fix: Split text should only have dataframe output#8362
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe changes consolidate the outputs of the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant SplitTextComponent
User->>SplitTextComponent: Call split_text()
SplitTextComponent->>SplitTextComponent: Process and split documents
SplitTextComponent->>SplitTextComponent: Convert to Data objects
SplitTextComponent->>SplitTextComponent: Construct DataFrame from Data objects
SplitTextComponent-->>User: Return DataFrame
Suggested labels
Suggested reviewers
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/backend/base/langflow/components/processing/split_text.py (1)
134-135: Add missing docstring for the split_text method.The implementation correctly returns a DataFrame by wrapping the result of
_docs_to_data(self.split_text_base()). However, the method is missing a docstring as flagged by the static analysis tool.+ def split_text(self) -> DataFrame: + """Split the input text into chunks and return as a DataFrame. + + Returns: + DataFrame: A DataFrame containing the text chunks with metadata. + """ - def split_text(self) -> DataFrame: return DataFrame(self._docs_to_data(self.split_text_base()))🧰 Tools
🪛 Pylint (3.3.7)
[convention] 134-134: Missing function or method docstring
(C0116)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/backend/base/langflow/components/processing/split_text.py(2 hunks)src/backend/tests/unit/components/processing/test_split_text_component.py(9 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/backend/tests/unit/components/processing/test_split_text_component.py (2)
src/backend/base/langflow/components/processing/split_text.py (2)
split_text(134-135)SplitTextComponent(9-135)src/backend/base/langflow/schema/dataframe.py (1)
DataFrame(11-206)
🪛 Pylint (3.3.7)
src/backend/tests/unit/components/processing/test_split_text_component.py
[convention] 58-58: Line too long (113/100)
(C0301)
[convention] 88-88: Line too long (113/100)
(C0301)
[convention] 118-118: Line too long (113/100)
(C0301)
[convention] 200-200: Line too long (101/100)
(C0301)
[convention] 220-220: Line too long (103/100)
(C0301)
[convention] 221-221: Line too long (105/100)
(C0301)
[convention] 222-222: Line too long (107/100)
(C0301)
[convention] 223-223: Line too long (107/100)
(C0301)
[convention] 244-244: Line too long (103/100)
(C0301)
[convention] 245-245: Line too long (105/100)
(C0301)
[convention] 246-246: Line too long (107/100)
(C0301)
[convention] 247-247: Line too long (107/100)
(C0301)
src/backend/base/langflow/components/processing/split_text.py
[convention] 134-134: Missing function or method docstring
(C0116)
🔇 Additional comments (6)
src/backend/base/langflow/components/processing/split_text.py (1)
66-66: LGTM! Output consolidation aligns with PR objectives.The change to a single DataFrame output named "dataframe" properly consolidates the previous separate outputs as intended by the PR.
src/backend/tests/unit/components/processing/test_split_text_component.py (5)
55-67: Excellent test updates for DataFrame interface validation.The test correctly verifies the new DataFrame output interface:
- Proper type assertion for DataFrame
- Row count validation using
len(data_frame)- Column structure verification
- DataFrame indexing with
.iloc[0]["text"]for accessing chunk contentThese changes properly validate the consolidated output format.
🧰 Tools
🪛 Pylint (3.3.7)
[convention] 58-58: Line too long (113/100)
(C0301)
155-162: Well-implemented metadata preservation test.The test correctly uses
data_frame.iterrows()to iterate over DataFrame rows and validates that metadata is preserved in each chunk. This is the proper way to access row data in a DataFrame.
200-200: Consistent DataFrame column access pattern.The test properly uses
results["text"][0]to access the text column data directly, which is an efficient way to access DataFrame column values.🧰 Tools
🪛 Pylint (3.3.7)
[convention] 200-200: Line too long (101/100)
(C0301)
220-223: Proper DataFrame access for multiple inputs test.The test correctly validates multiple text inputs by accessing the DataFrame columns directly with
results["text"][index]pattern, ensuring all chunks are properly created and accessible.🧰 Tools
🪛 Pylint (3.3.7)
[convention] 220-220: Line too long (103/100)
(C0301)
[convention] 221-221: Line too long (105/100)
(C0301)
[convention] 222-222: Line too long (107/100)
(C0301)
[convention] 223-223: Line too long (107/100)
(C0301)
269-269: Correct DataFrame type assertion for URL loader integration.The test properly verifies that the split_text method returns a DataFrame when working with URL loader output, ensuring the integration works with the new consolidated interface.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/backend/base/langflow/initial_setup/starter_projects/Vector Store RAG.json (1)
879-879: Embedded component code updated correctly.The JSON-embedded Python block for
SplitTextComponentnow reflects the singledataframeoutput viasplit_text, with no lingeringas_dataframeor list-based outputs. Ensure that the JSON string literal uses proper escaping to remain valid.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/backend/base/langflow/initial_setup/starter_projects/Pokédex Agent.json(1 hunks)src/backend/base/langflow/initial_setup/starter_projects/Vector Store RAG.json(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Run Ruff Check and Format
- GitHub Check: Optimize new Python code in this PR
- GitHub Check: Update Starter Projects
- GitHub Check: Ruff Style Check (3.13)
🔇 Additional comments (1)
src/backend/base/langflow/initial_setup/starter_projects/Pokédex Agent.json (1)
925-925: Formatting-only change detectedThis JSON configuration shows a re-serialization/formatting update with no semantic modifications to the
APIRequestComponent. No action is needed.
| "dataType": "SplitText", | ||
| "id": "SplitText-aHhAi", | ||
| "name": "chunks", | ||
| "output_types": [ | ||
| "Data" | ||
| ] | ||
| "output_types": [] | ||
| }, |
There was a problem hiding this comment.
Align output handle metadata with refactored component.
The "name": "chunks" and empty "output_types": [] must be updated to match the new output alias and type, i.e.:
- "name": "chunks",
- "output_types": []
+ "name": "dataframe",
+ "output_types": ["DataFrame"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "dataType": "SplitText", | |
| "id": "SplitText-aHhAi", | |
| "name": "chunks", | |
| "output_types": [ | |
| "Data" | |
| ] | |
| "output_types": [] | |
| }, | |
| "dataType": "SplitText", | |
| "id": "SplitText-aHhAi", | |
| "name": "dataframe", | |
| "output_types": ["DataFrame"] | |
| }, |
🤖 Prompt for AI Agents
In src/backend/base/langflow/initial_setup/starter_projects/Vector Store
RAG.json around lines 152 to 156, update the "name" and "output_types" fields in
the output handle metadata to match the refactored component's new output alias
and type. Replace "name": "chunks" with the new output alias and set
"output_types" to the correct array reflecting the new output type.
| "id": "reactflow__edge-SplitText-aHhAi{œdataTypeœ:œSplitTextœ,œidœ:œSplitText-aHhAiœ,œnameœ:œchunksœ,œoutput_typesœ:[œDataœ]}-AstraDB-lXzoG{œfieldNameœ:œingest_dataœ,œidœ:œAstraDB-lXzoGœ,œinputTypesœ:[œDataœ,œDataFrameœ],œtypeœ:œotherœ}", | ||
| "selected": false, | ||
| "source": "SplitText-aHhAi", | ||
| "sourceHandle": "{œdataTypeœ: œSplitTextœ, œidœ: œSplitText-aHhAiœ, œnameœ: œchunksœ, œoutput_typesœ: [œDataœ]}", | ||
| "sourceHandle": "{œdataTypeœ: œSplitTextœ, œidœ: œSplitText-aHhAiœ, œnameœ: œchunksœ, œoutput_typesœ: []}", | ||
| "target": "AstraDB-lXzoG", |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Update edge sourceHandle to new output handle.
The JSON-encoded sourceHandle still references "name": "chunks" and an empty output_types. It should reflect the new handle:
- "sourceHandle": "{œdataTypeœ: œSplitTextœ, œidœ: œSplitText-aHhAiœ, œnameœ: œchunksœ, œoutput_typesœ: []}"
+ "sourceHandle": "{œdataTypeœ: œSplitTextœ, œidœ: œSplitText-aHhAiœ, œnameœ: œdataframeœ, œoutput_typesœ: [œDataFrameœ]}"so that the starter project wires into the DataFrame output correctly.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "id": "reactflow__edge-SplitText-aHhAi{œdataTypeœ:œSplitTextœ,œidœ:œSplitText-aHhAiœ,œnameœ:œchunksœ,œoutput_typesœ:[œDataœ]}-AstraDB-lXzoG{œfieldNameœ:œingest_dataœ,œidœ:œAstraDB-lXzoGœ,œinputTypesœ:[œDataœ,œDataFrameœ],œtypeœ:œotherœ}", | |
| "selected": false, | |
| "source": "SplitText-aHhAi", | |
| "sourceHandle": "{œdataTypeœ: œSplitTextœ, œidœ: œSplitText-aHhAiœ, œnameœ: œchunksœ, œoutput_typesœ: [œDataœ]}", | |
| "sourceHandle": "{œdataTypeœ: œSplitTextœ, œidœ: œSplitText-aHhAiœ, œnameœ: œchunksœ, œoutput_typesœ: []}", | |
| "target": "AstraDB-lXzoG", | |
| "id": "reactflow__edge-SplitText-aHhAi{œdataTypeœ:œSplitTextœ,œidœ:œSplitText-aHhAiœ,œnameœ:œchunksœ,œoutput_typesœ:[œDataœ]}-AstraDB-lXzoG{œfieldNameœ:œingest_dataœ,œidœ:œAstraDB-lXzoGœ,œinputTypesœ:[œDataœ,œDataFrameœ],œtypeœ:œotherœ}", | |
| "selected": false, | |
| "source": "SplitText-aHhAi", | |
| "sourceHandle": "{œdataTypeœ: œSplitTextœ, œidœ: œSplitText-aHhAiœ, œnameœ: œdataframeœ, œoutput_typesœ: [œDataFrameœ]}", | |
| "target": "AstraDB-lXzoG", |
🤖 Prompt for AI Agents
In src/backend/base/langflow/initial_setup/starter_projects/Vector Store
RAG.json around lines 167 to 171, the edge's sourceHandle field still references
the old output handle with "name": "chunks" and an empty output_types array.
Update the sourceHandle to reflect the new output handle by changing the "name"
to the correct new output name and setting the "output_types" to include
"DataFrame" so that the wiring correctly connects to the DataFrame output.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/frontend/tests/core/features/filterSidebar.spec.ts(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (15)
- GitHub Check: Lint Backend / Run Mypy (3.11)
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 5
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 3
- GitHub Check: Lint Backend / Run Mypy (3.10)
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 4
- GitHub Check: Lint Backend / Run Mypy (3.12)
- GitHub Check: Run Frontend Tests / Determine Test Suites and Shard Distribution
- GitHub Check: Lint Backend / Run Mypy (3.13)
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 1
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 2
- GitHub Check: Run Backend Tests / Integration Tests - Python 3.10
- GitHub Check: Update Starter Projects
- GitHub Check: Run Ruff Check and Format
- GitHub Check: Optimize new Python code in this PR
- GitHub Check: Ruff Style Check (3.13)
| await expect(page.getByTestId("logicSub Flow [Deprecated]")).toBeVisible(); | ||
|
|
||
| await expect(page.getByTestId("processingSplit Text")).toBeVisible(); | ||
| await expect(page.getByTestId("processingData Operations")).toBeVisible(); |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Verify the component rename is complete and consistent.
The test ID was updated from "processingSplit Text" to "processingData Operations", which aligns with the backend refactoring. However, line 147 still tests that "processingSplit Text" should NOT be visible.
This creates a potential inconsistency - if the component was renamed, both assertions might be testing the same logical component. Please clarify:
- Was "Split Text" completely renamed to "Data Operations"?
- Should line 147 be updated or removed?
- Or do both components exist simultaneously for different purposes?
🏁 Script executed:
#!/bin/bash
# Search for any remaining references to "Split Text" component in the frontend codebase
rg -i "split.text" --type ts --type tsx --type js --type jsx -A 2 -B 2Length of output: 98
I’ll recheck without type filters to catch any lingering references:
🏁 Script executed:
#!/bin/bash
# Search for any references to "Split Text" across the repo
rg -n "Split Text"
# Search specifically for the old test ID "processingSplit"
rg -n "processingSplit"Length of output: 9284
Incomplete rename of the “Split Text” component – please update remaining test IDs
Our search shows that while line 136 in filterSidebar.spec.ts was updated to use processingData Operations, there are still hard-coded references to the old processingSplit Text ID in several frontend tests. These will break or become misleading once the UI no longer emits that test ID.
Please update the following locations to use "processingData Operations" (or remove them if the old component no longer exists):
src/frontend/tests/core/features/filterSidebar.spec.ts(line 147)src/frontend/tests/core/features/stop-building.spec.ts(line 47)src/frontend/tests/core/features/freeze.spec.ts(lines 48 and 53)
Also audit any related documentation or backend tests that should reflect the new “Data Operations” name.
🤖 Prompt for AI Agents
In src/frontend/tests/core/features/filterSidebar.spec.ts around lines 136 to
147, the test ID "processingSplit Text" was renamed to "processingData
Operations" on line 136, but line 147 still references the old ID. Update line
147 to use "processingData Operations" instead of "processingSplit Text" to
maintain consistency. Additionally, update similar references in
src/frontend/tests/core/features/stop-building.spec.ts at line 47 and
src/frontend/tests/core/features/freeze.spec.ts at lines 48 and 53. Also review
any related documentation or backend tests to reflect the new component name.
* fix: Split text should only have dataframe output * [autofix.ci] apply automated fixes * Update templates * test fix * Update Vector Store RAG.json * Update starter projects * fix tests * add shards --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Mike Fortman <michael.fortman@datastax.com> Co-authored-by: cristhianzl <cristhian.lousa@gmail.com>
This pull request renames the output of the split text component, and removes the Data output in favor of dataframe.
Summary by CodeRabbit
New Features
Bug Fixes
Tests