Skip to content

Conversation

@soyacz
Copy link
Collaborator

@soyacz soyacz commented Oct 31, 2025

When events arrive to Argus, we want to generate embeddings as soon as possible, so we can view similar/duplicate events close-to-realtime. In order to do that, we're going to use ScyllaDB Vector Search capabilities.

This PR introduces new 'similar_events_processor'. How it works: When new event comes, it pk+ck values are stored in new table 'unprocessed_events' which are queried by new similar events processor and processed in 1s intervals (close to realtime). When event is processed, embeddings are saved and processed keys from 'unprocessed_events' deleted.

This PR is not yet find similar/duplicate events yet - only calculate embeddings.

refs: https://github.com/scylladb/qa-tasks/issues/1967

@soyacz soyacz requested review from Copilot, fruch and k0machi October 31, 2025 15:15
@soyacz
Copy link
Collaborator Author

soyacz commented Oct 31, 2025

don't merge until Argus DB is not upgraded to 2025.4.x

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces AI-powered event similarity processing capabilities to Argus using vector embeddings. The main purpose is to enable similarity search and clustering of SCT test events (ERROR and CRITICAL severities) by implementing an event processing pipeline that generates embeddings and stores them in ScyllaDB's vector store.

Key changes:

  • Implementation of EventSimilarityProcessorV2 with BGE-Small-EN embedding model for processing test events
  • Addition of new database tables (sct_unprocessed_events, sct_error_event_embedding, sct_critical_event_embedding) with vector search capabilities
  • Integration of ScyllaDB vector store service and upgrade to ScyllaDB 2025.4.0-rc2 for native vector support

Reviewed Changes

Copilot reviewed 15 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pyproject.toml Added lz4 and chromadb dependencies; upgraded scylla-driver to 3.29.4; added new 'ai' optional dependency group
dev-db/docker-compose.yaml Upgraded ScyllaDB to 2025.4.0-rc2 and added vector-store service for AI capabilities
dev-db/alpha-config/scylla.yaml Configured vector_store_uri to connect ScyllaDB to the vector store service
argusAI/uv.lock Updated lock file with new dependencies and scylla-driver upgrade
argusAI/pyproject.toml Removed (replaced by root pyproject.toml configuration)
argusAI/event_similarity_processor_v2.py Core implementation of event processor with embedding generation and batch processing
argusAI/tests/test_event_similarity_processor_v2.py Comprehensive unit tests for the event processor
argus/backend/tests/sct_events/test_event_embedding_integration.py Integration tests for end-to-end event processing flow
argus/backend/tests/sct_events/conftest.py Test fixtures including embedding model stub to avoid network calls
argus/backend/tests/conftest.py Upgraded test database container to ScyllaDB 2025.4.0-rc1
argus/backend/plugins/sct/testrun.py Added SCTUnprocessedEvent model; removed unused uuid4 import
argus/backend/plugins/sct/service.py Added logic to create unprocessed event entries for ERROR/CRITICAL events
argus/backend/plugins/sct/plugin.py Registered SCTUnprocessedEvent model in plugin
argus/backend/models/web.py Imported and registered new embedding models
argus/backend/models/argus_ai.py Implemented Vector column type and new embedding tables with ANN index creation
argus/backend/db.py Fixed session usage in _sync_additional_rules to support index creation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@soyacz soyacz force-pushed the generate-embeddings-for-new-events branch 2 times, most recently from 3dc18be to efcda81 Compare October 31, 2025 16:47
@k0machi
Copy link
Collaborator

k0machi commented Nov 12, 2025

Could you separate the changes into several commits? Something akin to dependencies updates, actual changes and tests for those changes. It's quite a large PR to go through.

@soyacz soyacz force-pushed the generate-embeddings-for-new-events branch 2 times, most recently from 7b9a1ff to f650b73 Compare November 18, 2025 15:49
@soyacz
Copy link
Collaborator Author

soyacz commented Nov 18, 2025

@k0machi I've separated it a bit, but most of the things must go together. Please review.

Copy link
Collaborator

@k0machi k0machi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes needed in the processor and maybe additional documentation.

- ./scylla:/var/lib/scylla:rw
- ./alpha-config:/etc/scylla:rw
- ./alpha-config.d:/etc/scylla.d:rw
vector-store:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this now a critical component of argus? This should probably be mentioned in the commit/readme and docs should be updated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not critical, only when using "ANN" queries - which are not here yet.

We want to process event embeddings as they come.
To avoid querying all events for finding unprocessed ones,
store them in a new table which will be read by similar events processor
and pruned when event is processed.
When events arrive to Argus, we want to generate embeddings as soon as
possible, so we can view similar/duplicate events close-to-realtime.
In order to do that, we're going to use ScyllaDB Vector Search
capabilities.

This PR introduces new 'similar_events_processor'. How it works:
When new event comes, it pk+ck values are stored in new table
'unprocessed_events' which are queried by new similar events processor
and processed in 1s intervals (close to realtime). When event is
processed, embeddings are saved and processed keys from
'unprocessed_events' deleted.

This PR is not yet find similar/duplicate events yet - only calculate
embeddings.

refs: scylladb/qa-tasks#1967
@soyacz soyacz force-pushed the generate-embeddings-for-new-events branch from f650b73 to d7a53a5 Compare November 21, 2025 12:27
@soyacz soyacz marked this pull request as draft November 21, 2025 14:30
@soyacz
Copy link
Collaborator Author

soyacz commented Nov 21, 2025

I moved it to 'draft' because we cannot merge it before Argus Db is not upgraded to 2025.4 (due vector field required).
I'll continue work on this PR with similar events search.

Similar events system for new sct events approach.
It looks like it was before (without showing number
of similars) - no attaching issue to event yet.

refs: scylladb/qa-tasks#1967
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants