You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit adds Apache Doris as a new vector database option for Dify's RAG system.
Features:
- Vector similarity search using cosine distance
- Full-text search with BM25 scoring and inverted indexes
- Hybrid search combining vector and text search
- High-performance bulk data loading via StreamLoad
- Connection pooling for efficient resource management
- Support for multi-tenant isolation
Components added:
- DorisVector: Main vector database implementation with cleaned code
- DorisConfig: Configuration model with validation
- DorisConnectionPool: Thread-safe connection management
- DorisVectorFactory: Factory for creating Doris instances
- DORIS_SETUP.md: Complete setup guide in English
- Add type annotations to fix mypy errors (pool config dict, params lists)
- Add USE database statement in _get_cursor to ensure database context
- Add _wait_for_table_normal_state method to wait for schema changes
before creating text index (fixes index creation race condition)
- Extend Redis lock timeout to accommodate schema change waiting
- Update unit tests to account for new USE database statement
DeepSource reviewed changes in the commit range b76c8fa..413534b on this pull request. Below is the summary for the review, and you can see the individual issues we found as review comments.
Some issues found as part of this review are outside of the diff in this pull request and aren't shown in the inline review comments due to GitHub's API limitations. Please see the DeepSource dashboard for this PR to view those issues.
PR Report Card
Security
× 3 issues
Overall PR Quality
Focus Area: Security
Guidance Fix the SQL injection risks from string-based query construction in api/core/rag/datasource/vdb/doris/doris_vector.py.
Administrators can configure which issue categories are reported and cause analysis to be marked as failed when detected. This helps prevent bad and insecure code from being introduced in the codebase. If you're an administrator, you can modify this in the repository's settings.
💡 If you're a repository administrator, you can configure the quality gates from the settings.
The reason will be displayed to describe this comment to others. Learn more.
Using `hashlib.md5` exposes to collision attacks
The use of hashlib.md5 for hashing self.table_name is insecure due to MD5's susceptibility to collision attacks. Attackers could create different inputs yielding the same hash, enabling spoofing or impersonation risks.
Replace hashlib.md5 with stronger algorithms like hashlib.sha256 or hashlib.sha512 to ensure cryptographic security and collision resistance.
The reason will be displayed to describe this comment to others. Learn more.
String-based query with dynamic placeholders allows SQL injection
The code constructs an SQL query string using an f-string with dynamic table name self.table_name and a placeholder string inside cur.execute. If self.table_name or placeholders comes from user input or untrusted sources, it can lead to SQL injection, allowing attackers to manipulate the query and compromise the database.
Replace dynamic string interpolation with parameterized queries and avoid including untrusted data like self.table_name directly in the SQL string. Use query parameters for conditions and ensure table name is sanitized or fixed.
The reason will be displayed to describe this comment to others. Learn more.
String-based SQL query construction risks injection
Using string interpolation with self.table_name in SQL query construction can lead to SQL injection if these inputs can be manipulated, enabling attackers to execute arbitrary SQL commands. The risk specifically involves self.table_name and JSON path values which are inserted without parameterization.
Use parameterized SQL queries and avoid direct string insertion for table names and query fragments. Ensure any dynamic parts are strictly validated or use query builders that safely encode identifiers.
The reason will be displayed to describe this comment to others. Learn more.
Unbound method without `@staticmethod` wastes memory
The method test_valid_config is defined with a self parameter but does not use the instance context, leading Python to create a bound method for every instance which wastes memory and adds minor runtime overhead.
Add the @staticmethod decorator to test_valid_config to indicate it does not require an instance. This reduces memory and computational cost by avoiding bound method creation.
The reason will be displayed to describe this comment to others. Learn more.
Unused `pool` variable wastes memory and confuses
The variable pool is assigned with DorisConnectionPool(self.config) but is never referenced later, which wastes memory and can confuse developers about its intent or usage
Remove the unused pool variable or rename it to _pool or _ if keeping for clarity or intentional unused assignment
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.