Skip to content

Add Apache Doris vector store support #7

Open
jaffrey-deepsource wants to merge 3 commits into
mainfrom
doris_vector
Open

Add Apache Doris vector store support #7
jaffrey-deepsource wants to merge 3 commits into
mainfrom
doris_vector

Conversation

@jaffrey-deepsource
Copy link
Copy Markdown

No description provided.

dataroaring and others added 3 commits January 13, 2026 08:41
This commit adds Apache Doris as a new vector database option for Dify's RAG system.

Features:
- Vector similarity search using cosine distance
- Full-text search with BM25 scoring and inverted indexes
- Hybrid search combining vector and text search
- High-performance bulk data loading via StreamLoad
- Connection pooling for efficient resource management
- Support for multi-tenant isolation

Components added:
- DorisVector: Main vector database implementation with cleaned code
- DorisConfig: Configuration model with validation
- DorisConnectionPool: Thread-safe connection management
- DorisVectorFactory: Factory for creating Doris instances
- DORIS_SETUP.md: Complete setup guide in English
- Add type annotations to fix mypy errors (pool config dict, params lists)
- Add USE database statement in _get_cursor to ensure database context
- Add _wait_for_table_normal_state method to wait for schema changes
  before creating text index (fixes index creation race condition)
- Extend Redis lock timeout to accommodate schema change waiting
- Update unit tests to account for new USE database statement
@deepsource-development
Copy link
Copy Markdown

deepsource-development Bot commented Feb 12, 2026

DeepSource Code Review

DeepSource reviewed changes in the commit range b76c8fa..413534b on this pull request. Below is the summary for the review, and you can see the individual issues we found as review comments.

For detailed review results, please see the PR on DeepSource ↗

Important

Some issues found as part of this review are outside of the diff in this pull request and aren't shown in the inline review comments due to GitHub's API limitations. Please see the DeepSource dashboard for this PR to view those issues.

PR Report Card

Security × 3 issues Overall PR Quality   

Focus Area: Security

Guidance
Fix the SQL injection risks from string-based query construction in api/core/rag/datasource/vdb/doris/doris_vector.py.
Reliability × 1 issue
Complexity × 2 issues
Hygiene × 2 issues

Code Review Summary

Analyzer Status Summary Details
Python 9 new issues detected. 1 existing issue fixed. Review ↗
Secrets No new issues detected. Review ↗
How are these analyzer statuses calculated?

Administrators can configure which issue categories are reported and cause analysis to be marked as failed when detected. This helps prevent bad and insecure code from being introduced in the codebase. If you're an administrator, you can modify this in the repository's settings.


💡 If you're a repository administrator, you can configure the quality gates from the settings.

# Table name format: embedding_ + collection_name
# collection_name already includes Vector_index_ prefix and _Node suffix from Dataset.gen_collection_name_by_id
self.table_name = f"embedding_{collection_name}"
self.index_hash = hashlib.md5(self.table_name.encode()).hexdigest()[:8]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using `hashlib.md5` exposes to collision attacks


The use of hashlib.md5 for hashing self.table_name is insecure due to MD5's susceptibility to collision attacks. Attackers could create different inputs yielding the same hash, enabling spoofing or impersonation risks.

Replace hashlib.md5 with stronger algorithms like hashlib.sha256 or hashlib.sha512 to ensure cryptographic security and collision resistance.

with self._get_cursor() as cur:
try:
placeholders = ",".join(["%s"] * len(ids))
cur.execute(f"DELETE FROM `{self.table_name}` WHERE id IN ({placeholders})", ids)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String-based query with dynamic placeholders allows SQL injection


The code constructs an SQL query string using an f-string with dynamic table name self.table_name and a placeholder string inside cur.execute. If self.table_name or placeholders comes from user input or untrusted sources, it can lead to SQL injection, allowing attackers to manipulate the query and compromise the database.
Replace dynamic string interpolation with parameterized queries and avoid including untrusted data like self.table_name directly in the SQL string. Use query parameters for conditions and ensure table name is sanitized or fixed.

try:
# Use JSON_EXTRACT for JSON field access
cur.execute(
f"DELETE FROM `{self.table_name}` WHERE JSON_EXTRACT(meta, %s) = %s",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String-based SQL query construction risks injection


Using string interpolation with self.table_name in SQL query construction can lead to SQL injection if these inputs can be manipulated, enabling attackers to execute arbitrary SQL commands. The risk specifically involves self.table_name and JSON path values which are inserted without parameterization.

Use parameterized SQL queries and avoid direct string insertion for table names and query fragments. Ensure any dynamic parts are strictly validated or use query builders that safely encode identifiers.

class TestDorisConfig(unittest.TestCase):
"""Tests for DorisConfig validation."""

def test_valid_config(self):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbound method without `@staticmethod` wastes memory


The method test_valid_config is defined with a self parameter but does not use the instance context, leading Python to create a bound method for every instance which wastes memory and adds minor runtime overhead.

Add the @staticmethod decorator to test_valid_config to indicate it does not require an instance. This reduces memory and computational cost by avoiding bound method creation.

mock_pool = MagicMock()
mock_pool_class.return_value = mock_pool

pool = DorisConnectionPool(self.config)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused `pool` variable wastes memory and confuses


The variable pool is assigned with DorisConnectionPool(self.config) but is never referenced later, which wastes memory and can confuse developers about its intent or usage
Remove the unused pool variable or rename it to _pool or _ if keeping for clarity or intentional unused assignment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants