Skip to content

fix: Use the identifier column as hash if available#9405

Merged
edwinjosechittilappilly merged 8 commits into
mainfrom
fix-hash-kb-id
Aug 19, 2025
Merged

fix: Use the identifier column as hash if available#9405
edwinjosechittilappilly merged 8 commits into
mainfrom
fix-hash-kb-id

Conversation

@erichare
Copy link
Copy Markdown
Collaborator

@erichare erichare commented Aug 15, 2025

This pull request updates the knowledge base ingestion process to improve how identifier columns are handled and refines the data object conversion logic. The most important changes are focused on the configuration of input columns and the row conversion logic, ensuring identifiers are properly separated from content intended for vectorization.

Input configuration improvements:

  • Added a new column configuration for id in the NewKnowledgeBaseInput class, marking it as an identifier and not for vectorization. This makes it easier to distinguish between content and unique identifiers during ingestion.

Row conversion logic updates:

  • Refactored the _convert_df_to_data_objects method to explicitly separate identifier columns from content columns when building the main content text and metadata, ensuring identifiers are handled distinctly and not vectorized.

Project configuration update:

  • Updated the code_hash in the Knowledge Ingestion.json starter project to reflect the latest code changes.

Summary by CodeRabbit

  • New Features

    • Knowledge Ingestion now includes a default identifier column (“id”) in the column configuration.
    • Ingestion prioritizes identifier columns for embeddings and duplicate detection when present, falling back to text content otherwise.
    • UI supports identifier-aware setup out of the box for smoother configuration.
  • Chores

    • Updated the Knowledge Ingestion starter project to include the new default column configuration and logic.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 15, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Updates KB ingestion defaults to include an identifier column and modifies ingestion logic to derive page content (and document IDs) from identifier columns when present, otherwise from vectorized content columns. Starter project JSON is updated to reflect the new defaults and component code hash.

Changes

Cohort / File(s) Summary of changes
KB ingestion logic
src/backend/base/langflow/components/data/kb_ingest.py
Added default column_config entry for "id" as identifier (not vectorized). Updated _convert_df_to_data_objects to build page_content from identifier columns when present; otherwise from vectorized content columns. Adjusts embedded text and ID/hash generation accordingly.
Starter project template
src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json
Updated KBIngestion component block and code_hash. Default column_config now includes "text" (vectorize: true) and "id" (identifier: true, not vectorized). Mirrors ingestion logic change affecting content formation for embeddings and deduplication.

Sequence Diagram(s)

sequenceDiagram
  participant UI as UI/Starter Project
  participant KB as KBIngestion Component
  participant Conv as _convert_df_to_data_objects
  participant Embed as Embedder
  participant Store as Vector Store

  UI->>KB: Provide DataFrame + column_config (text + id)
  KB->>Conv: Convert rows to data objects
  alt Identifier columns present
    Conv->>Conv: Build page_content from identifier cols
  else No identifier columns
    Conv->>Conv: Build page_content from vectorized content cols
  end
  Conv->>Embed: Embed page_content
  Embed->>Store: Upsert vectors (IDs/hash from page_content)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

bug, size:L, lgtm

Suggested reviewers

  • deon-sanchez
  • edwinjosechittilappilly
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-hash-kb-id

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 15, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (3)
src/backend/base/langflow/components/data/kb_ingest.py (2)

182-201: Validate column config: prevent conflicts and enforce at least one vectorized column

Edge cases to guard:

  • A column marked both vectorize and identifier (conflicting roles).
  • No vectorized columns selected (would embed empty strings).

Add validation here to fail fast with actionable messages.

Apply this diff:

     def _validate_column_config(self, df_source: pd.DataFrame) -> list[dict[str, Any]]:
         """Validate column configuration using Structured Output patterns."""
         if not self.column_config:
             msg = "Column configuration cannot be empty"
             raise ValueError(msg)
@@
-        # Validate column names exist in DataFrame
+        # Validate column names exist in DataFrame and config semantics
         df_columns = set(df_source.columns)
+        seen_cols: set[str] = set()
+        vectorized_cols: list[str] = []
         for config in config_list:
             col_name = config.get("column_name")
             if col_name not in df_columns and not self.silent_errors:
                 msg = f"Column '{col_name}' not found in DataFrame. Available columns: {sorted(df_columns)}"
                 self.log(f"Warning: {msg}")
                 raise ValueError(msg)
+            if col_name in seen_cols:
+                msg = f"Duplicate column configuration for '{col_name}'. Please configure each column only once."
+                raise ValueError(msg)
+            seen_cols.add(col_name)
+
+            # Normalize booleans that might come as strings
+            vectorize = config.get("vectorize") == "True" or config.get("vectorize") is True
+            identifier = config.get("identifier") == "True" or config.get("identifier") is True
+
+            if vectorize and identifier:
+                msg = f"Column '{col_name}' cannot be both vectorized and an identifier."
+                raise ValueError(msg)
+            if vectorize:
+                vectorized_cols.append(col_name)
 
-        return config_list
+        if not vectorized_cols:
+            msg = "At least one column must be set to 'Vectorize' to generate embeddings."
+            raise ValueError(msg)
+
+        return config_list

394-445: Clarify identifier vs content roles, strengthen hashing, and use O(1) duplicate checks

Misleading naming, brittle ID construction, and O(N) duplicate checks in kb_ingest.py — apply the patch below.

  • File: src/backend/base/langflow/components/data/kb_ingest.py (around lines 394–445)
-        # Get all documents and their metadata
-        all_docs = chroma.get()
-
-        # Extract all _id values from metadata
-        id_list = [metadata.get("_id") for metadata in all_docs["metadatas"] if metadata.get("_id")]
+        # Get all documents and their metadata (opt: use include=['metadatas'] if supported)
+        all_docs = chroma.get()
+        # Extract existing _id values and use a set for O(1) lookups
+        existing_ids = [metadata.get("_id") for metadata in all_docs.get("metadatas", []) if metadata.get("_id")]
+        id_set = set(existing_ids)
@@
-        for _, row in df_source.iterrows():
-            # Build content text from identifier columns using list comprehension
-            identifier_parts = [str(row[col]) for col in content_cols if col in row and pd.notna(row[col])]
-
-            # Join all parts into a single string
-            page_content = " ".join(identifier_parts)
-
-            # Build metadata from NON-vectorized columns only (simple key-value pairs)
-            data_dict = {
-                "text": page_content,  # Main content for vectorization
-            }
-
-            # Add identifier columns if they exist
-            if identifier_cols:
-                identifier_parts = [str(row[col]) for col in identifier_cols if col in row and pd.notna(row[col])]
-                page_content = " ".join(identifier_parts)
-
-            # Add metadata columns as simple key-value pairs
-            for col in df_source.columns:
-                if col not in content_cols and col in row and pd.notna(row[col]):
-                    # Convert to simple types for Chroma metadata
-                    value = row[col]
-                    data_dict[col] = str(value)  # Convert complex types to string
-
-            # Hash the page_content for unique ID
-            page_content_hash = hashlib.sha256(page_content.encode()).hexdigest()
-            data_dict["_id"] = page_content_hash
-
-            # If duplicates are disallowed, and hash exists, prevent adding this row
-            if not self.allow_duplicates and page_content_hash in id_list:
-                self.log(f"Skipping duplicate row with hash {page_content_hash}")
-                continue
+        for _, row in df_source.iterrows():
+            # Build content text from vectorized content columns (this is what we embed)
+            content_parts = [str(row[col]) for col in content_cols if col in row and pd.notna(row[col])]
+            content_text = " ".join(content_parts)
+
+            # Main content for vectorization (identifiers intentionally excluded)
+            data_dict = {"text": content_text}
+
+            # Build identifier string (if any) solely for stable hashing/IDs.
+            # Include column names and a non-printable delimiter to avoid collisions.
+            if identifier_cols:
+                id_parts = [
+                    f"{col}={str(row[col])}" for col in identifier_cols if col in row and pd.notna(row[col])
+                ]
+                id_string = "\x1F".join(id_parts)  # unit separator
+            else:
+                id_string = ""
+
+            # Add metadata columns as simple key-value pairs (exclude vectorized content cols)
+            for col in df_source.columns:
+                if col not in content_cols and col in row and pd.notna(row[col]):
+                    data_dict[col] = str(row[col])
+
+            # Hash identifier string when present; otherwise fallback to content_text
+            id_basis = id_string if id_string else content_text
+            row_hash = hashlib.sha256(id_basis.encode("utf-8")).hexdigest()
+            data_dict["_id"] = row_hash
+
+            # If duplicates are disallowed, and hash exists, prevent adding this row (O(1) check)
+            if not self.allow_duplicates and row_hash in id_set:
+                self.log(f"Skipping duplicate row with hash {row_hash}")
+                continue

Notes:

  • The change clarifies variable intent (content_text vs id_string), makes ID generation robust (column names + unit separator), and replaces O(N) membership tests with O(1) set lookups.
  • Optional: if chroma.get supports include=["metadatas"], use it to avoid loading unnecessary data.
src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (1)

862-869: Starter project default column_config still missing the identifier row

The node’s template "column_config.value" only includes the "text" row. This contradicts the new default in the component and will ship a project without the "id" identifier preconfigured.

Update the value to include the "id" identifier row.

Apply this diff:

-                "value": [
-                  {
-                    "column_name": "text",
-                    "identifier": false,
-                    "vectorize": true
-                  }
-                ]
+                "value": [
+                  {
+                    "column_name": "text",
+                    "vectorize": true,
+                    "identifier": false
+                  },
+                  {
+                    "column_name": "id",
+                    "vectorize": false,
+                    "identifier": true
+                  }
+                ]
🧹 Nitpick comments (1)
src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (1)

798-799: Embedded component code updated — keep parity with kb_ingest.py

The embedded code reflects the new identifier handling. If you accept the refactor in kb_ingest.py (content_text/id_string and robust hashing), mirror the same changes here to avoid divergence between the starter project and the installed component.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 226c71b and ad05802.

📒 Files selected for processing (2)
  • src/backend/base/langflow/components/data/kb_ingest.py (2 hunks)
  • src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
src/backend/base/langflow/components/**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

src/backend/base/langflow/components/**/*.py: Add new backend components to the appropriate subdirectory under src/backend/base/langflow/components/
Implement async component methods using async def and await for asynchronous operations
Use asyncio.create_task for background work in async components and ensure proper cleanup on cancellation
Use asyncio.Queue for non-blocking queue operations in async components and handle timeouts appropriately

Files:

  • src/backend/base/langflow/components/data/kb_ingest.py
{src/backend/**/*.py,tests/**/*.py,Makefile}

📄 CodeRabbit Inference Engine (.cursor/rules/backend_development.mdc)

{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code

Files:

  • src/backend/base/langflow/components/data/kb_ingest.py
src/backend/**/components/**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/icons.mdc)

In your Python component class, set the icon attribute to a string matching the frontend icon mapping exactly (case-sensitive).

Files:

  • src/backend/base/langflow/components/data/kb_ingest.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test-starter-projects
🔇 Additional comments (2)
src/backend/base/langflow/components/data/kb_ingest.py (1)

139-149: Good default: identifier column added and excluded from vectorization

Adding the "id" column with identifier=True and vectorize=False aligns with the intent to keep IDs out of embeddings. This will help deduplication and stable IDs.

src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (1)

705-707: Code hash updated — OK

The starter project points to the updated component code. Keep the embedded code string in sync with the actual component file when applying follow-up refactors.

@tsuliuchao
Copy link
Copy Markdown
Contributor

perpect~!

Comment thread src/backend/base/langflow/components/data/kb_ingest.py Outdated
@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 15, 2025
@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 15, 2025
@erichare erichare enabled auto-merge August 15, 2025 17:46
@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 18, 2025
Comment thread src/backend/base/langflow/components/data/kb_ingest.py Outdated
Copy link
Copy Markdown
Collaborator

@edwinjosechittilappilly edwinjosechittilappilly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn the indentifier to be true for text by default.

Co-authored-by: Edwin Jose <edwin.jose@datastax.com>
@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 19, 2025
@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 19, 2025
@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 19, 2025
@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Aug 19, 2025
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Collaborator

@edwinjosechittilappilly edwinjosechittilappilly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@erichare erichare added this pull request to the merge queue Aug 19, 2025
@github-actions github-actions Bot added the lgtm This PR has been approved by a maintainer label Aug 19, 2025
github-merge-queue Bot pushed a commit that referenced this pull request Aug 19, 2025
* fix: Use the identifier column as hash if available

* Update kb_ingest.py

* [autofix.ci] apply automated fixes

* Update src/backend/base/langflow/components/data/kb_ingest.py

Co-authored-by: Edwin Jose <edwin.jose@datastax.com>

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Edwin Jose <edwin.jose@datastax.com>
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Aug 19, 2025
@edwinjosechittilappilly edwinjosechittilappilly added this pull request to the merge queue Aug 19, 2025
Merged via the queue into main with commit e63e879 Aug 19, 2025
21 checks passed
@edwinjosechittilappilly edwinjosechittilappilly deleted the fix-hash-kb-id branch August 19, 2025 21:48
lucaseduoli pushed a commit that referenced this pull request Aug 22, 2025
* fix: Use the identifier column as hash if available

* Update kb_ingest.py

* [autofix.ci] apply automated fixes

* Update src/backend/base/langflow/components/data/kb_ingest.py

Co-authored-by: Edwin Jose <edwin.jose@datastax.com>

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Edwin Jose <edwin.jose@datastax.com>
lucaseduoli pushed a commit that referenced this pull request Aug 25, 2025
* fix: Use the identifier column as hash if available

* Update kb_ingest.py

* [autofix.ci] apply automated fixes

* Update src/backend/base/langflow/components/data/kb_ingest.py

Co-authored-by: Edwin Jose <edwin.jose@datastax.com>

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Edwin Jose <edwin.jose@datastax.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants