-
Notifications
You must be signed in to change notification settings - Fork 14
feat: process event embeddings in realtime #835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
don't merge until Argus DB is not upgraded to 2025.4.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces AI-powered event similarity processing capabilities to Argus using vector embeddings. The main purpose is to enable similarity search and clustering of SCT test events (ERROR and CRITICAL severities) by implementing an event processing pipeline that generates embeddings and stores them in ScyllaDB's vector store.
Key changes:
- Implementation of EventSimilarityProcessorV2 with BGE-Small-EN embedding model for processing test events
- Addition of new database tables (sct_unprocessed_events, sct_error_event_embedding, sct_critical_event_embedding) with vector search capabilities
- Integration of ScyllaDB vector store service and upgrade to ScyllaDB 2025.4.0-rc2 for native vector support
Reviewed Changes
Copilot reviewed 15 out of 18 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Added lz4 and chromadb dependencies; upgraded scylla-driver to 3.29.4; added new 'ai' optional dependency group |
| dev-db/docker-compose.yaml | Upgraded ScyllaDB to 2025.4.0-rc2 and added vector-store service for AI capabilities |
| dev-db/alpha-config/scylla.yaml | Configured vector_store_uri to connect ScyllaDB to the vector store service |
| argusAI/uv.lock | Updated lock file with new dependencies and scylla-driver upgrade |
| argusAI/pyproject.toml | Removed (replaced by root pyproject.toml configuration) |
| argusAI/event_similarity_processor_v2.py | Core implementation of event processor with embedding generation and batch processing |
| argusAI/tests/test_event_similarity_processor_v2.py | Comprehensive unit tests for the event processor |
| argus/backend/tests/sct_events/test_event_embedding_integration.py | Integration tests for end-to-end event processing flow |
| argus/backend/tests/sct_events/conftest.py | Test fixtures including embedding model stub to avoid network calls |
| argus/backend/tests/conftest.py | Upgraded test database container to ScyllaDB 2025.4.0-rc1 |
| argus/backend/plugins/sct/testrun.py | Added SCTUnprocessedEvent model; removed unused uuid4 import |
| argus/backend/plugins/sct/service.py | Added logic to create unprocessed event entries for ERROR/CRITICAL events |
| argus/backend/plugins/sct/plugin.py | Registered SCTUnprocessedEvent model in plugin |
| argus/backend/models/web.py | Imported and registered new embedding models |
| argus/backend/models/argus_ai.py | Implemented Vector column type and new embedding tables with ANN index creation |
| argus/backend/db.py | Fixed session usage in _sync_additional_rules to support index creation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3dc18be to
efcda81
Compare
|
Could you separate the changes into several commits? Something akin to dependencies updates, actual changes and tests for those changes. It's quite a large PR to go through. |
7b9a1ff to
f650b73
Compare
|
@k0machi I've separated it a bit, but most of the things must go together. Please review. |
k0machi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some changes needed in the processor and maybe additional documentation.
| - ./scylla:/var/lib/scylla:rw | ||
| - ./alpha-config:/etc/scylla:rw | ||
| - ./alpha-config.d:/etc/scylla.d:rw | ||
| vector-store: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this now a critical component of argus? This should probably be mentioned in the commit/readme and docs should be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not critical, only when using "ANN" queries - which are not here yet.
We want to process event embeddings as they come. To avoid querying all events for finding unprocessed ones, store them in a new table which will be read by similar events processor and pruned when event is processed.
When events arrive to Argus, we want to generate embeddings as soon as possible, so we can view similar/duplicate events close-to-realtime. In order to do that, we're going to use ScyllaDB Vector Search capabilities. This PR introduces new 'similar_events_processor'. How it works: When new event comes, it pk+ck values are stored in new table 'unprocessed_events' which are queried by new similar events processor and processed in 1s intervals (close to realtime). When event is processed, embeddings are saved and processed keys from 'unprocessed_events' deleted. This PR is not yet find similar/duplicate events yet - only calculate embeddings. refs: scylladb/qa-tasks#1967
f650b73 to
d7a53a5
Compare
|
I moved it to 'draft' because we cannot merge it before Argus Db is not upgraded to 2025.4 (due vector field required). |
Similar events system for new sct events approach. It looks like it was before (without showing number of similars) - no attaching issue to event yet. refs: scylladb/qa-tasks#1967
When events arrive to Argus, we want to generate embeddings as soon as possible, so we can view similar/duplicate events close-to-realtime. In order to do that, we're going to use ScyllaDB Vector Search capabilities.
This PR introduces new 'similar_events_processor'. How it works: When new event comes, it pk+ck values are stored in new table 'unprocessed_events' which are queried by new similar events processor and processed in 1s intervals (close to realtime). When event is processed, embeddings are saved and processed keys from 'unprocessed_events' deleted.
This PR is not yet find similar/duplicate events yet - only calculate embeddings.
refs: https://github.com/scylladb/qa-tasks/issues/1967