fix: Use the identifier column as hash if available by erichare · Pull Request #9405 · langflow-ai/langflow

erichare · 2025-08-15T00:48:45Z

This pull request updates the knowledge base ingestion process to improve how identifier columns are handled and refines the data object conversion logic. The most important changes are focused on the configuration of input columns and the row conversion logic, ensuring identifiers are properly separated from content intended for vectorization.

Input configuration improvements:

Added a new column configuration for id in the NewKnowledgeBaseInput class, marking it as an identifier and not for vectorization. This makes it easier to distinguish between content and unique identifiers during ingestion.

Row conversion logic updates:

Refactored the _convert_df_to_data_objects method to explicitly separate identifier columns from content columns when building the main content text and metadata, ensuring identifiers are handled distinctly and not vectorized.

Project configuration update:

Updated the code_hash in the Knowledge Ingestion.json starter project to reflect the latest code changes.

Summary by CodeRabbit

New Features
- Knowledge Ingestion now includes a default identifier column (“id”) in the column configuration.
- Ingestion prioritizes identifier columns for embeddings and duplicate detection when present, falling back to text content otherwise.
- UI supports identifier-aware setup out of the box for smoother configuration.
Chores
- Updated the Knowledge Ingestion starter project to include the new default column configuration and logic.

coderabbitai · 2025-08-15T00:48:54Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Updates KB ingestion defaults to include an identifier column and modifies ingestion logic to derive page content (and document IDs) from identifier columns when present, otherwise from vectorized content columns. Starter project JSON is updated to reflect the new defaults and component code hash.

Changes

Cohort / File(s)	Summary of changes
KB ingestion logic `src/backend/base/langflow/components/data/kb_ingest.py`	Added default column_config entry for `"id"` as identifier (not vectorized). Updated _convert_df_to_data_objects to build page_content from identifier columns when present; otherwise from vectorized content columns. Adjusts embedded text and ID/hash generation accordingly.
Starter project template `src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json`	Updated KBIngestion component block and code_hash. Default column_config now includes `"text"` (vectorize: true) and `"id"` (identifier: true, not vectorized). Mirrors ingestion logic change affecting content formation for embeddings and deduplication.

Sequence Diagram(s)

sequenceDiagram
  participant UI as UI/Starter Project
  participant KB as KBIngestion Component
  participant Conv as _convert_df_to_data_objects
  participant Embed as Embedder
  participant Store as Vector Store

  UI->>KB: Provide DataFrame + column_config (text + id)
  KB->>Conv: Convert rows to data objects
  alt Identifier columns present
    Conv->>Conv: Build page_content from identifier cols
  else No identifier columns
    Conv->>Conv: Build page_content from vectorized content cols
  end
  Conv->>Embed: Embed page_content
  Embed->>Store: Upsert vectors (IDs/hash from page_content)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat: Add support for Ingestion and Retrieval of Knowledge Bases #9088: Similar modification to KB ingestion: page_content derived from identifier columns and default column_config updated.

Suggested labels

bug, size:L, lgtm

Suggested reviewers

deon-sanchez
edwinjosechittilappilly

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix-hash-kb-id

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (3)

src/backend/base/langflow/components/data/kb_ingest.py (2)

182-201: Validate column config: prevent conflicts and enforce at least one vectorized column

Edge cases to guard:

A column marked both vectorize and identifier (conflicting roles).
No vectorized columns selected (would embed empty strings).

Add validation here to fail fast with actionable messages.

Apply this diff:

     def _validate_column_config(self, df_source: pd.DataFrame) -> list[dict[str, Any]]:
         """Validate column configuration using Structured Output patterns."""
         if not self.column_config:
             msg = "Column configuration cannot be empty"
             raise ValueError(msg)
@@
-        # Validate column names exist in DataFrame
+        # Validate column names exist in DataFrame and config semantics
         df_columns = set(df_source.columns)
+        seen_cols: set[str] = set()
+        vectorized_cols: list[str] = []
         for config in config_list:
             col_name = config.get("column_name")
             if col_name not in df_columns and not self.silent_errors:
                 msg = f"Column '{col_name}' not found in DataFrame. Available columns: {sorted(df_columns)}"
                 self.log(f"Warning: {msg}")
                 raise ValueError(msg)
+            if col_name in seen_cols:
+                msg = f"Duplicate column configuration for '{col_name}'. Please configure each column only once."
+                raise ValueError(msg)
+            seen_cols.add(col_name)
+
+            # Normalize booleans that might come as strings
+            vectorize = config.get("vectorize") == "True" or config.get("vectorize") is True
+            identifier = config.get("identifier") == "True" or config.get("identifier") is True
+
+            if vectorize and identifier:
+                msg = f"Column '{col_name}' cannot be both vectorized and an identifier."
+                raise ValueError(msg)
+            if vectorize:
+                vectorized_cols.append(col_name)
 
-        return config_list
+        if not vectorized_cols:
+            msg = "At least one column must be set to 'Vectorize' to generate embeddings."
+            raise ValueError(msg)
+
+        return config_list

394-445: Clarify identifier vs content roles, strengthen hashing, and use O(1) duplicate checks

Misleading naming, brittle ID construction, and O(N) duplicate checks in kb_ingest.py — apply the patch below.

File: src/backend/base/langflow/components/data/kb_ingest.py (around lines 394–445)

-        # Get all documents and their metadata
-        all_docs = chroma.get()
-
-        # Extract all _id values from metadata
-        id_list = [metadata.get("_id") for metadata in all_docs["metadatas"] if metadata.get("_id")]
+        # Get all documents and their metadata (opt: use include=['metadatas'] if supported)
+        all_docs = chroma.get()
+        # Extract existing _id values and use a set for O(1) lookups
+        existing_ids = [metadata.get("_id") for metadata in all_docs.get("metadatas", []) if metadata.get("_id")]
+        id_set = set(existing_ids)
@@
-        for _, row in df_source.iterrows():
-            # Build content text from identifier columns using list comprehension
-            identifier_parts = [str(row[col]) for col in content_cols if col in row and pd.notna(row[col])]
-
-            # Join all parts into a single string
-            page_content = " ".join(identifier_parts)
-
-            # Build metadata from NON-vectorized columns only (simple key-value pairs)
-            data_dict = {
-                "text": page_content,  # Main content for vectorization
-            }
-
-            # Add identifier columns if they exist
-            if identifier_cols:
-                identifier_parts = [str(row[col]) for col in identifier_cols if col in row and pd.notna(row[col])]
-                page_content = " ".join(identifier_parts)
-
-            # Add metadata columns as simple key-value pairs
-            for col in df_source.columns:
-                if col not in content_cols and col in row and pd.notna(row[col]):
-                    # Convert to simple types for Chroma metadata
-                    value = row[col]
-                    data_dict[col] = str(value)  # Convert complex types to string
-
-            # Hash the page_content for unique ID
-            page_content_hash = hashlib.sha256(page_content.encode()).hexdigest()
-            data_dict["_id"] = page_content_hash
-
-            # If duplicates are disallowed, and hash exists, prevent adding this row
-            if not self.allow_duplicates and page_content_hash in id_list:
-                self.log(f"Skipping duplicate row with hash {page_content_hash}")
-                continue
+        for _, row in df_source.iterrows():
+            # Build content text from vectorized content columns (this is what we embed)
+            content_parts = [str(row[col]) for col in content_cols if col in row and pd.notna(row[col])]
+            content_text = " ".join(content_parts)
+
+            # Main content for vectorization (identifiers intentionally excluded)
+            data_dict = {"text": content_text}
+
+            # Build identifier string (if any) solely for stable hashing/IDs.
+            # Include column names and a non-printable delimiter to avoid collisions.
+            if identifier_cols:
+                id_parts = [
+                    f"{col}={str(row[col])}" for col in identifier_cols if col in row and pd.notna(row[col])
+                ]
+                id_string = "\x1F".join(id_parts)  # unit separator
+            else:
+                id_string = ""
+
+            # Add metadata columns as simple key-value pairs (exclude vectorized content cols)
+            for col in df_source.columns:
+                if col not in content_cols and col in row and pd.notna(row[col]):
+                    data_dict[col] = str(row[col])
+
+            # Hash identifier string when present; otherwise fallback to content_text
+            id_basis = id_string if id_string else content_text
+            row_hash = hashlib.sha256(id_basis.encode("utf-8")).hexdigest()
+            data_dict["_id"] = row_hash
+
+            # If duplicates are disallowed, and hash exists, prevent adding this row (O(1) check)
+            if not self.allow_duplicates and row_hash in id_set:
+                self.log(f"Skipping duplicate row with hash {row_hash}")
+                continue

Notes:

The change clarifies variable intent (content_text vs id_string), makes ID generation robust (column names + unit separator), and replaces O(N) membership tests with O(1) set lookups.
Optional: if chroma.get supports include=["metadatas"], use it to avoid loading unnecessary data.

src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (1)

862-869: Starter project default column_config still missing the identifier row

The node’s template "column_config.value" only includes the "text" row. This contradicts the new default in the component and will ship a project without the "id" identifier preconfigured.

Update the value to include the "id" identifier row.

Apply this diff:

-                "value": [
-                  {
-                    "column_name": "text",
-                    "identifier": false,
-                    "vectorize": true
-                  }
-                ]
+                "value": [
+                  {
+                    "column_name": "text",
+                    "vectorize": true,
+                    "identifier": false
+                  },
+                  {
+                    "column_name": "id",
+                    "vectorize": false,
+                    "identifier": true
+                  }
+                ]

🧹 Nitpick comments (1)

src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (1)

798-799: Embedded component code updated — keep parity with kb_ingest.py

The embedded code reflects the new identifier handling. If you accept the refactor in kb_ingest.py (content_text/id_string and robust hashing), mirror the same changes here to avoid divergence between the starter project and the installed component.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 226c71b and ad05802.

📒 Files selected for processing (2)

src/backend/base/langflow/components/data/kb_ingest.py (2 hunks)
src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (2 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

src/backend/base/langflow/components/**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

src/backend/base/langflow/components/**/*.py: Add new backend components to the appropriate subdirectory under src/backend/base/langflow/components/
Implement async component methods using async def and await for asynchronous operations
Use asyncio.create_task for background work in async components and ensure proper cleanup on cancellation
Use asyncio.Queue for non-blocking queue operations in async components and handle timeouts appropriately

Files:

src/backend/base/langflow/components/data/kb_ingest.py

{src/backend/**/*.py,tests/**/*.py,Makefile}

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code

Files:

src/backend/base/langflow/components/data/kb_ingest.py

src/backend/**/components/**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/icons.mdc)

In your Python component class, set the icon attribute to a string matching the frontend icon mapping exactly (case-sensitive).

Files:

src/backend/base/langflow/components/data/kb_ingest.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: test-starter-projects

🔇 Additional comments (2)

src/backend/base/langflow/components/data/kb_ingest.py (1)

139-149: Good default: identifier column added and excluded from vectorization

Adding the "id" column with identifier=True and vectorize=False aligns with the intent to keep IDs out of embeddings. This will help deduplication and stable IDs.

src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (1)

705-707: Code hash updated — OK

The starter project points to the updated component code. Keep the embedded code string in sync with the actual component file when applying follow-up refactors.

tsuliuchao · 2025-08-15T11:27:35Z

perpect~!

edwinjosechittilappilly

Isn the indentifier to be true for text by default.

Co-authored-by: Edwin Jose <edwin.jose@datastax.com>

sonarqubecloud · 2025-08-19T20:30:16Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

edwinjosechittilappilly

LGTM

* fix: Use the identifier column as hash if available * Update kb_ingest.py * [autofix.ci] apply automated fixes * Update src/backend/base/langflow/components/data/kb_ingest.py Co-authored-by: Edwin Jose <edwin.jose@datastax.com> * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Edwin Jose <edwin.jose@datastax.com>

fix: Use the identifier column as hash if available

ad05802

erichare requested review from carlosrcoelho and edwinjosechittilappilly August 15, 2025 00:49

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 15, 2025

coderabbitai Bot reviewed Aug 15, 2025

View reviewed changes

edwinjosechittilappilly reviewed Aug 15, 2025

View reviewed changes

Comment thread src/backend/base/langflow/components/data/kb_ingest.py Outdated

Update kb_ingest.py

26ab314

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 15, 2025

[autofix.ci] apply automated fixes

35f258f

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 15, 2025

erichare enabled auto-merge August 15, 2025 17:46

erichare requested a review from edwinjosechittilappilly August 15, 2025 17:46

Merge branch 'main' into fix-hash-kb-id

78a1165

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 18, 2025

edwinjosechittilappilly reviewed Aug 19, 2025

View reviewed changes

Comment thread src/backend/base/langflow/components/data/kb_ingest.py Outdated

edwinjosechittilappilly reviewed Aug 19, 2025

View reviewed changes

Update src/backend/base/langflow/components/data/kb_ingest.py

a36f01c

Co-authored-by: Edwin Jose <edwin.jose@datastax.com>

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 19, 2025

[autofix.ci] apply automated fixes

e6712f0

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 19, 2025

erichare requested a review from edwinjosechittilappilly August 19, 2025 15:06

Merge branch 'main' into fix-hash-kb-id

5da936e

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 19, 2025

[autofix.ci] apply automated fixes

d67931f

github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 19, 2025

edwinjosechittilappilly approved these changes Aug 19, 2025

View reviewed changes

erichare added this pull request to the merge queue Aug 19, 2025

github-actions Bot added the lgtm This PR has been approved by a maintainer label Aug 19, 2025

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Aug 19, 2025

edwinjosechittilappilly added this pull request to the merge queue Aug 19, 2025

Merged via the queue into main with commit e63e879 Aug 19, 2025
21 checks passed

edwinjosechittilappilly deleted the fix-hash-kb-id branch August 19, 2025 21:48

coderabbitai Bot mentioned this pull request Aug 20, 2025

feat: Make knowledge bases user-stored and support global vars #9458

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Use the identifier column as hash if available#9405

fix: Use the identifier column as hash if available#9405
edwinjosechittilappilly merged 8 commits into
mainfrom
fix-hash-kb-id

erichare commented Aug 15, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Aug 15, 2025 •

edited

Loading

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Uh oh!

tsuliuchao commented Aug 15, 2025

Uh oh!

Uh oh!

Uh oh!

edwinjosechittilappilly left a comment

Uh oh!

sonarqubecloud Bot commented Aug 19, 2025

Uh oh!

edwinjosechittilappilly left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

erichare commented Aug 15, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

tsuliuchao commented Aug 15, 2025

Uh oh!

Uh oh!

Uh oh!

edwinjosechittilappilly left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Aug 19, 2025

Quality Gate passed

Uh oh!

edwinjosechittilappilly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

erichare commented Aug 15, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Aug 15, 2025 •

edited

Loading