feat: process event embeddings in realtime #835

soyacz · 2025-10-31T15:15:37Z

When events arrive to Argus, we want to generate embeddings as soon as possible, so we can view similar/duplicate events close-to-realtime. In order to do that, we're going to use ScyllaDB Vector Search capabilities.

This PR introduces new 'similar_events_processor'. How it works: When new event comes, it pk+ck values are stored in new table 'unprocessed_events' which are queried by new similar events processor and processed in 1s intervals (close to realtime). When event is processed, embeddings are saved and processed keys from 'unprocessed_events' deleted.

This PR is not yet find similar/duplicate events yet - only calculate embeddings.

refs: https://github.com/scylladb/qa-tasks/issues/1967

soyacz · 2025-10-31T15:16:06Z

don't merge until Argus DB is not upgraded to 2025.4.x

Copilot

Pull Request Overview

This PR introduces AI-powered event similarity processing capabilities to Argus using vector embeddings. The main purpose is to enable similarity search and clustering of SCT test events (ERROR and CRITICAL severities) by implementing an event processing pipeline that generates embeddings and stores them in ScyllaDB's vector store.

Key changes:

Implementation of EventSimilarityProcessorV2 with BGE-Small-EN embedding model for processing test events
Addition of new database tables (sct_unprocessed_events, sct_error_event_embedding, sct_critical_event_embedding) with vector search capabilities
Integration of ScyllaDB vector store service and upgrade to ScyllaDB 2025.4.0-rc2 for native vector support

Reviewed Changes

Copilot reviewed 15 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
pyproject.toml	Added lz4 and chromadb dependencies; upgraded scylla-driver to 3.29.4; added new 'ai' optional dependency group
dev-db/docker-compose.yaml	Upgraded ScyllaDB to 2025.4.0-rc2 and added vector-store service for AI capabilities
dev-db/alpha-config/scylla.yaml	Configured vector_store_uri to connect ScyllaDB to the vector store service
argusAI/uv.lock	Updated lock file with new dependencies and scylla-driver upgrade
argusAI/pyproject.toml	Removed (replaced by root pyproject.toml configuration)
argusAI/event_similarity_processor_v2.py	Core implementation of event processor with embedding generation and batch processing
argusAI/tests/test_event_similarity_processor_v2.py	Comprehensive unit tests for the event processor
argus/backend/tests/sct_events/test_event_embedding_integration.py	Integration tests for end-to-end event processing flow
argus/backend/tests/sct_events/conftest.py	Test fixtures including embedding model stub to avoid network calls
argus/backend/tests/conftest.py	Upgraded test database container to ScyllaDB 2025.4.0-rc1
argus/backend/plugins/sct/testrun.py	Added SCTUnprocessedEvent model; removed unused uuid4 import
argus/backend/plugins/sct/service.py	Added logic to create unprocessed event entries for ERROR/CRITICAL events
argus/backend/plugins/sct/plugin.py	Registered SCTUnprocessedEvent model in plugin
argus/backend/models/web.py	Imported and registered new embedding models
argus/backend/models/argus_ai.py	Implemented Vector column type and new embedding tables with ANN index creation
argus/backend/db.py	Fixed session usage in _sync_additional_rules to support index creation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

argus/backend/models/argus_ai.py

argusAI/event_similarity_processor_v2.py

argus/backend/models/argus_ai.py

k0machi · 2025-11-12T10:27:04Z

Could you separate the changes into several commits? Something akin to dependencies updates, actual changes and tests for those changes. It's quite a large PR to go through.

soyacz · 2025-11-18T15:49:57Z

@k0machi I've separated it a bit, but most of the things must go together. Please review.

k0machi

Some changes needed in the processor and maybe additional documentation.

k0machi · 2025-11-20T08:05:07Z

dev-db/docker-compose.yaml

+      - ./scylla:/var/lib/scylla:rw
      - ./alpha-config:/etc/scylla:rw
+      - ./alpha-config.d:/etc/scylla.d:rw
+  vector-store:


Is this now a critical component of argus? This should probably be mentioned in the commit/readme and docs should be updated.

not critical, only when using "ANN" queries - which are not here yet.

argus/backend/models/argus_ai.py

argusAI/event_similarity_processor_v2.py

We want to process event embeddings as they come. To avoid querying all events for finding unprocessed ones, store them in a new table which will be read by similar events processor and pruned when event is processed.

When events arrive to Argus, we want to generate embeddings as soon as possible, so we can view similar/duplicate events close-to-realtime. In order to do that, we're going to use ScyllaDB Vector Search capabilities. This PR introduces new 'similar_events_processor'. How it works: When new event comes, it pk+ck values are stored in new table 'unprocessed_events' which are queried by new similar events processor and processed in 1s intervals (close to realtime). When event is processed, embeddings are saved and processed keys from 'unprocessed_events' deleted. This PR is not yet find similar/duplicate events yet - only calculate embeddings. refs: scylladb/qa-tasks#1967

soyacz · 2025-11-21T14:31:36Z

I moved it to 'draft' because we cannot merge it before Argus Db is not upgraded to 2025.4 (due vector field required).
I'll continue work on this PR with similar events search.

Similar events system for new sct events approach. It looks like it was before (without showing number of similars) - no attaching issue to event yet. refs: scylladb/qa-tasks#1967

soyacz requested review from Copilot, fruch and k0machi October 31, 2025 15:15

Copilot AI reviewed Oct 31, 2025

View reviewed changes

soyacz force-pushed the generate-embeddings-for-new-events branch 2 times, most recently from 3dc18be to efcda81 Compare October 31, 2025 16:47

soyacz force-pushed the generate-embeddings-for-new-events branch 2 times, most recently from 7b9a1ff to f650b73 Compare November 18, 2025 15:49

k0machi requested changes Nov 20, 2025

View reviewed changes

soyacz added 2 commits November 21, 2025 11:38

feat: add new events to unprocessed queue

5f26326

We want to process event embeddings as they come. To avoid querying all events for finding unprocessed ones, store them in a new table which will be read by similar events processor and pruned when event is processed.

soyacz force-pushed the generate-embeddings-for-new-events branch from f650b73 to d7a53a5 Compare November 21, 2025 12:27

soyacz marked this pull request as draft November 21, 2025 14:30

feat: find similar events (new events system)

f2005b3

Similar events system for new sct events approach. It looks like it was before (without showing number of similars) - no attaching issue to event yet. refs: scylladb/qa-tasks#1967

soyacz mentioned this pull request Nov 26, 2025

Start using Scylladb Vector Store instead of Chromadb #808

Open

feat: process event embeddings in realtime #835

Are you sure you want to change the base?

feat: process event embeddings in realtime #835

Uh oh!

Conversation

soyacz commented Oct 31, 2025

Uh oh!

soyacz commented Oct 31, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

k0machi commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soyacz commented Nov 18, 2025

Uh oh!

k0machi left a comment

Choose a reason for hiding this comment

Uh oh!

k0machi Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

soyacz Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

soyacz commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

k0machi commented Nov 12, 2025 •

edited

Loading