fix: Use the identifier column as hash if available#9405
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughUpdates KB ingestion defaults to include an identifier column and modifies ingestion logic to derive page content (and document IDs) from identifier columns when present, otherwise from vectorized content columns. Starter project JSON is updated to reflect the new defaults and component code hash. Changes
Sequence Diagram(s)sequenceDiagram
participant UI as UI/Starter Project
participant KB as KBIngestion Component
participant Conv as _convert_df_to_data_objects
participant Embed as Embedder
participant Store as Vector Store
UI->>KB: Provide DataFrame + column_config (text + id)
KB->>Conv: Convert rows to data objects
alt Identifier columns present
Conv->>Conv: Build page_content from identifier cols
else No identifier columns
Conv->>Conv: Build page_content from vectorized content cols
end
Conv->>Embed: Embed page_content
Embed->>Store: Upsert vectors (IDs/hash from page_content)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 0
🔭 Outside diff range comments (3)
src/backend/base/langflow/components/data/kb_ingest.py (2)
182-201: Validate column config: prevent conflicts and enforce at least one vectorized columnEdge cases to guard:
- A column marked both vectorize and identifier (conflicting roles).
- No vectorized columns selected (would embed empty strings).
Add validation here to fail fast with actionable messages.
Apply this diff:
def _validate_column_config(self, df_source: pd.DataFrame) -> list[dict[str, Any]]: """Validate column configuration using Structured Output patterns.""" if not self.column_config: msg = "Column configuration cannot be empty" raise ValueError(msg) @@ - # Validate column names exist in DataFrame + # Validate column names exist in DataFrame and config semantics df_columns = set(df_source.columns) + seen_cols: set[str] = set() + vectorized_cols: list[str] = [] for config in config_list: col_name = config.get("column_name") if col_name not in df_columns and not self.silent_errors: msg = f"Column '{col_name}' not found in DataFrame. Available columns: {sorted(df_columns)}" self.log(f"Warning: {msg}") raise ValueError(msg) + if col_name in seen_cols: + msg = f"Duplicate column configuration for '{col_name}'. Please configure each column only once." + raise ValueError(msg) + seen_cols.add(col_name) + + # Normalize booleans that might come as strings + vectorize = config.get("vectorize") == "True" or config.get("vectorize") is True + identifier = config.get("identifier") == "True" or config.get("identifier") is True + + if vectorize and identifier: + msg = f"Column '{col_name}' cannot be both vectorized and an identifier." + raise ValueError(msg) + if vectorize: + vectorized_cols.append(col_name) - return config_list + if not vectorized_cols: + msg = "At least one column must be set to 'Vectorize' to generate embeddings." + raise ValueError(msg) + + return config_list
394-445: Clarify identifier vs content roles, strengthen hashing, and use O(1) duplicate checksMisleading naming, brittle ID construction, and O(N) duplicate checks in kb_ingest.py — apply the patch below.
- File: src/backend/base/langflow/components/data/kb_ingest.py (around lines 394–445)
- # Get all documents and their metadata - all_docs = chroma.get() - - # Extract all _id values from metadata - id_list = [metadata.get("_id") for metadata in all_docs["metadatas"] if metadata.get("_id")] + # Get all documents and their metadata (opt: use include=['metadatas'] if supported) + all_docs = chroma.get() + # Extract existing _id values and use a set for O(1) lookups + existing_ids = [metadata.get("_id") for metadata in all_docs.get("metadatas", []) if metadata.get("_id")] + id_set = set(existing_ids) @@ - for _, row in df_source.iterrows(): - # Build content text from identifier columns using list comprehension - identifier_parts = [str(row[col]) for col in content_cols if col in row and pd.notna(row[col])] - - # Join all parts into a single string - page_content = " ".join(identifier_parts) - - # Build metadata from NON-vectorized columns only (simple key-value pairs) - data_dict = { - "text": page_content, # Main content for vectorization - } - - # Add identifier columns if they exist - if identifier_cols: - identifier_parts = [str(row[col]) for col in identifier_cols if col in row and pd.notna(row[col])] - page_content = " ".join(identifier_parts) - - # Add metadata columns as simple key-value pairs - for col in df_source.columns: - if col not in content_cols and col in row and pd.notna(row[col]): - # Convert to simple types for Chroma metadata - value = row[col] - data_dict[col] = str(value) # Convert complex types to string - - # Hash the page_content for unique ID - page_content_hash = hashlib.sha256(page_content.encode()).hexdigest() - data_dict["_id"] = page_content_hash - - # If duplicates are disallowed, and hash exists, prevent adding this row - if not self.allow_duplicates and page_content_hash in id_list: - self.log(f"Skipping duplicate row with hash {page_content_hash}") - continue + for _, row in df_source.iterrows(): + # Build content text from vectorized content columns (this is what we embed) + content_parts = [str(row[col]) for col in content_cols if col in row and pd.notna(row[col])] + content_text = " ".join(content_parts) + + # Main content for vectorization (identifiers intentionally excluded) + data_dict = {"text": content_text} + + # Build identifier string (if any) solely for stable hashing/IDs. + # Include column names and a non-printable delimiter to avoid collisions. + if identifier_cols: + id_parts = [ + f"{col}={str(row[col])}" for col in identifier_cols if col in row and pd.notna(row[col]) + ] + id_string = "\x1F".join(id_parts) # unit separator + else: + id_string = "" + + # Add metadata columns as simple key-value pairs (exclude vectorized content cols) + for col in df_source.columns: + if col not in content_cols and col in row and pd.notna(row[col]): + data_dict[col] = str(row[col]) + + # Hash identifier string when present; otherwise fallback to content_text + id_basis = id_string if id_string else content_text + row_hash = hashlib.sha256(id_basis.encode("utf-8")).hexdigest() + data_dict["_id"] = row_hash + + # If duplicates are disallowed, and hash exists, prevent adding this row (O(1) check) + if not self.allow_duplicates and row_hash in id_set: + self.log(f"Skipping duplicate row with hash {row_hash}") + continueNotes:
- The change clarifies variable intent (content_text vs id_string), makes ID generation robust (column names + unit separator), and replaces O(N) membership tests with O(1) set lookups.
- Optional: if chroma.get supports include=["metadatas"], use it to avoid loading unnecessary data.
src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (1)
862-869: Starter project default column_config still missing the identifier rowThe node’s template "column_config.value" only includes the "text" row. This contradicts the new default in the component and will ship a project without the "id" identifier preconfigured.
Update the value to include the "id" identifier row.
Apply this diff:
- "value": [ - { - "column_name": "text", - "identifier": false, - "vectorize": true - } - ] + "value": [ + { + "column_name": "text", + "vectorize": true, + "identifier": false + }, + { + "column_name": "id", + "vectorize": false, + "identifier": true + } + ]
🧹 Nitpick comments (1)
src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (1)
798-799: Embedded component code updated — keep parity with kb_ingest.pyThe embedded code reflects the new identifier handling. If you accept the refactor in kb_ingest.py (content_text/id_string and robust hashing), mirror the same changes here to avoid divergence between the starter project and the installed component.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
src/backend/base/langflow/components/data/kb_ingest.py(2 hunks)src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
src/backend/base/langflow/components/**/*.py
📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)
src/backend/base/langflow/components/**/*.py: Add new backend components to the appropriate subdirectory under src/backend/base/langflow/components/
Implement async component methods using async def and await for asynchronous operations
Use asyncio.create_task for background work in async components and ensure proper cleanup on cancellation
Use asyncio.Queue for non-blocking queue operations in async components and handle timeouts appropriately
Files:
src/backend/base/langflow/components/data/kb_ingest.py
{src/backend/**/*.py,tests/**/*.py,Makefile}
📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)
{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code
Files:
src/backend/base/langflow/components/data/kb_ingest.py
src/backend/**/components/**/*.py
📄 CodeRabbit Inference Engine (.cursor/rules/icons.mdc)
In your Python component class, set the
iconattribute to a string matching the frontend icon mapping exactly (case-sensitive).
Files:
src/backend/base/langflow/components/data/kb_ingest.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test-starter-projects
🔇 Additional comments (2)
src/backend/base/langflow/components/data/kb_ingest.py (1)
139-149: Good default: identifier column added and excluded from vectorizationAdding the "id" column with identifier=True and vectorize=False aligns with the intent to keep IDs out of embeddings. This will help deduplication and stable IDs.
src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (1)
705-707: Code hash updated — OKThe starter project points to the updated component code. Keep the embedded code string in sync with the actual component file when applying follow-up refactors.
|
perpect~! |
edwinjosechittilappilly
left a comment
There was a problem hiding this comment.
Isn the indentifier to be true for text by default.
Co-authored-by: Edwin Jose <edwin.jose@datastax.com>
|
* fix: Use the identifier column as hash if available * Update kb_ingest.py * [autofix.ci] apply automated fixes * Update src/backend/base/langflow/components/data/kb_ingest.py Co-authored-by: Edwin Jose <edwin.jose@datastax.com> * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Edwin Jose <edwin.jose@datastax.com>
* fix: Use the identifier column as hash if available * Update kb_ingest.py * [autofix.ci] apply automated fixes * Update src/backend/base/langflow/components/data/kb_ingest.py Co-authored-by: Edwin Jose <edwin.jose@datastax.com> * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Edwin Jose <edwin.jose@datastax.com>
* fix: Use the identifier column as hash if available * Update kb_ingest.py * [autofix.ci] apply automated fixes * Update src/backend/base/langflow/components/data/kb_ingest.py Co-authored-by: Edwin Jose <edwin.jose@datastax.com> * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Edwin Jose <edwin.jose@datastax.com>



This pull request updates the knowledge base ingestion process to improve how identifier columns are handled and refines the data object conversion logic. The most important changes are focused on the configuration of input columns and the row conversion logic, ensuring identifiers are properly separated from content intended for vectorization.
Input configuration improvements:
idin theNewKnowledgeBaseInputclass, marking it as an identifier and not for vectorization. This makes it easier to distinguish between content and unique identifiers during ingestion.Row conversion logic updates:
_convert_df_to_data_objectsmethod to explicitly separate identifier columns from content columns when building the main content text and metadata, ensuring identifiers are handled distinctly and not vectorized.Project configuration update:
code_hashin theKnowledge Ingestion.jsonstarter project to reflect the latest code changes.Summary by CodeRabbit
New Features
Chores