feat: add text operations component by Empreiteiro · Pull Request #11201 · langflow-ai/langflow

Empreiteiro · 2026-01-05T19:56:09Z

Adding a new component that performs various operations with text.

Operations Available

- **Text to DataFrame**: Parse markdown-style tables into DataFrames
- **Word Count**: Count words, characters, lines in text
- **Case Conversion**: Convert to uppercase, lowercase, title case
- **Text Replace**: Replace text patterns with new values
- **Text Extract**: Extract text matching patterns
- **Text Head**: Extract characters from the beginning of text
- **Text Tail**: Extract characters from the end of text
- **Text Strip**: Remove whitespace or specific characters from edges
- **Text Format**: Format text with padding, alignment, etc.
- **Text Split**: Split text into parts based on delimiters
- **Text Join**: Join text parts with separators
- **Text Clean**: Remove extra whitespace, special characters

Summary by CodeRabbit

New Features
- Added a text processing toolkit with 10 operations: word/character counting, case conversion, find-and-replace, regex extraction, substring extraction, whitespace trimming, text joining, and text cleaning.
- Convert text to structured table format automatically.
- Dynamic input and output configuration adapts based on the selected operation.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-05T19:56:28Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

A new TextOperations component is introduced with dynamic operation selection, flexible input/output configuration, and support for ten text operations including word count, case conversion, text replacement, extraction, head/tail, stripping, joining, cleaning, and markdown table to DataFrame conversion.

Changes

Cohort / File(s)	Summary
Text Operations Component `src/lfx/src/lfx/components/processing/text_operations.py`	New TextOperations class with dynamic operation selection via sortable list; configuration methods update_build_config and update_outputs; central process_text dispatcher; ten operation implementations (word_count, case_conversion, text_replace, text_extract, text_head, text_tail, text_strip, text_join, text_clean, text_to_dataframe); result getters (get_dataframe, get_text, get_data, get_message); internal state tracking and logging.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 error, 2 warnings)

Check name	Status	Explanation	Resolution
Test Coverage For New Implementations	❌ Error	The PR introduces a comprehensive TextOperations component with 10+ text processing operations but includes no corresponding unit tests, violating the project's test structure standards.	Create a comprehensive test file at src/backend/tests/unit/components/processing/test_text_operations_component.py with unit tests for each operation method, edge cases, error handling, and result caching mechanism.
Test Quality And Coverage	⚠️ Warning	TextOperations component with 736 lines and 20 public methods has zero test coverage despite project's established pytest patterns and dedicated test directories for similar components.	Create comprehensive pytest test file at src/backend/tests/unit/components/processing/test_text_operations.py covering all 10 text operations, edge cases, error handling, dataframe conversion, and dynamic configuration methods.
Test File Naming And Structure	⚠️ Warning	TextOperations component lacks corresponding unit test file following established pytest pattern.	Create test file at src/lfx/tests/unit/components/processing/test_text_operations.py with comprehensive pytest cases for all 10+ operations, edge cases, and dynamic behavior.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 95.00% which is sufficient. The required threshold is 80.00%.
Excessive Mock Usage Warning	✅ Passed	No test files present in PR, so excessive mock usage check is not applicable.
Title check	✅ Passed	The pull request title 'feat: add text operations component' directly matches the main change: introducing a new TextOperations component with comprehensive text processing capabilities.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Fix all issues with AI Agents 🤖

In @src/lfx/src/lfx/components/processing/text_operations.py:
- Line 253: The file src/lfx/src/lfx/components/processing/text_operations.py
contains blank lines with trailing whitespace (notably around the areas
indicated: lines ~253, 302, 308, 325, 331, 348); remove the trailing spaces on
those blank lines so they are truly empty (you can run a whitespace-trimming
command such as sed -i 's/[[:space:]]\+$//'
src/lfx/src/lfx/components/processing/text_operations.py or use your editor’s
trim-trailing-whitespace feature), then verify there are no remaining trailing
spaces on blank lines and commit the cleaned file.
- Around line 594-614: The text_split method is dead/invalid: either remove it
or wire it into the component by adding "Text Split" to OPERATION_CHOICES,
adding inputs for split_delimiter and max_splits to the inputs list, and
updating the dispatcher in process_text to call text_split when operation ==
"Text Split", plus ensure update_build_config and update_outputs handle the new
operation and outputs; alternatively delete text_split if unused. Specifically,
if keeping it, add "Text Split" to OPERATION_CHOICES, declare input fields
split_delimiter (string, default ",") and max_splits (int, default -1) in the
inputs definition, add a branch in process_text to call self.text_split(text)
and set self._result appropriately, and update
update_build_config/update_outputs to include the output schema and persistence
for the Text Split operation so the method is reachable and its inputs are
defined.
- Line 239: The function signature for update_build_config uses a parameter with
default None (field_name: str = None) which must be explicitly typed per PEP
484; update the annotation to use Optional[str] (or str | None for Python 3.10+)
and add the corresponding import from typing if necessary so the signature
becomes field_name: Optional[str] = None (or field_name: str | None = None).
- Around line 707-722: get_data currently calls process_text() again causing
duplicate processing; change it to reuse the cached result in self._result like
get_text/get_dataframe: check operation via get_operation_name(), and for "Word
Count" ensure you only call self.process_text() if self._result is None
(assigning its return to self._result), then build and return the Data object
from self._result (handling dict/list/other cases) instead of calling
process_text() directly.
- Around line 413-416: The try/except block around pd.to_numeric(df[col],
errors='ignore') is using a bare except which can hide system exceptions; change
it to catch only the expected exceptions (ValueError and TypeError) so failures
converting df[col] are handled without swallowing KeyboardInterrupt/SystemExit;
update the except to "except (ValueError, TypeError):" and keep the pass (or
optionally log the conversion failure) while leaving pd.to_numeric and df[col]
intact.
- Around line 671-683: get_dataframe currently re-invokes text_to_dataframe
instead of reusing the already computed result stored on the instance; change
get_dataframe (and specifically the "Text to DataFrame" branch) to check
self._result first and, if it exists and is a DataFrame (or a wrapper DataFrame
type), return that directly; only call self.text_to_dataframe(text) as a
fallback if self._result is absent, and ensure any returned value is
wrapped/converted to the expected DataFrame type to preserve existing behavior.
- Around line 685-705: The get_text method references a non-existent "Text
Format" operation and re-invokes processing causing duplicate work; remove "Text
Format" from the text_operations list and change the logic to use the cached
result (self._result) if present, only calling process_text() when self._result
is None; keep the rest of the list (Case Conversion, Text Replace, Text Extract,
Text Head, Text Tail, Text Strip, Text Join, Text Clean), then format the cached
or newly produced result (handle list vs scalar) into Message(text=...) and
return it, otherwise return an empty Message.
- Around line 1-17: Imports must be alphabetized and grouped (stdlib,
third-party, local) and the deprecated typing aliases removed: delete List and
Dict from the typing import and reorder the import block accordingly; then
replace all occurrences of Dict[str, Any] with dict[str, Any] and all
occurrences of List[str] with list[str] (search for the type hints used inside
the text operation functions and return/parameter annotations such as the
mapping/dataset handlers and any variables that previously used List or Dict) so
the module uses built-in generic types and the imports are sorted.

🧹 Nitpick comments (2)

src/lfx/src/lfx/components/processing/text_operations.py (2)

557-591: Consider simplifying the strip logic.

The else clauses on lines 571-572 and 581-582 duplicate the "both" case logic. This makes the code harder to maintain.

🔎 Proposed refactor

     def text_strip(self, text: str) -> str:
         """Remove whitespace or specific characters from the beginning and/or end of text."""
         try:
             strip_mode = getattr(self, "strip_mode", "both")
             strip_characters = getattr(self, "strip_characters", "")
             
+            # Determine strip function based on mode
+            strip_fn = {
+                "both": text.strip,
+                "left": text.lstrip,
+                "right": text.rstrip,
+            }.get(strip_mode, text.strip)
+            
-            if strip_characters:
-                # Strip specific characters
-                if strip_mode == "both":
-                    result = text.strip(strip_characters)
-                elif strip_mode == "left":
-                    result = text.lstrip(strip_characters)
-                elif strip_mode == "right":
-                    result = text.rstrip(strip_characters)
-                else:
-                    result = text.strip(strip_characters)
-            else:
-                # Strip whitespace (default behavior)
-                if strip_mode == "both":
-                    result = text.strip()
-                elif strip_mode == "left":
-                    result = text.lstrip()
-                elif strip_mode == "right":
-                    result = text.rstrip()
-                else:
-                    result = text.strip()
+            # Apply strip with or without specific characters
+            result = strip_fn(strip_characters) if strip_characters else strip_fn()
             
             self._result = result
             removed_chars = len(text) - len(result)
             self.log(f"Stripped {removed_chars} characters from {strip_mode} side(s)")
             return result
             
         except Exception as e:
             self.log(f"Error stripping text: {str(e)}")
             return text

234-237: Remove unused _operation_result instance variable.

The _operation_result attribute is initialized but never used anywhere in the code. Only _result is actually utilized.

🔎 Proposed fix

     def __init__(self, **kwargs):
         super().__init__(**kwargs)
         self._result = None
-        self._operation_result = None

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fdc1b3b and 0866b86.

📒 Files selected for processing (1)

src/lfx/src/lfx/components/processing/text_operations.py

🧰 Additional context used

🧬 Code graph analysis (1)

src/lfx/src/lfx/components/processing/text_operations.py (2)

src/lfx/src/lfx/schema/data.py (1)

Data (26-288)

src/lfx/src/lfx/schema/message.py (1)

Message (34-315)

🪛 GitHub Actions: Ruff Style Check

src/lfx/src/lfx/components/processing/text_operations.py

[error] 1-1: I001 Import block is un-sorted or un-formatted

🪛 GitHub Check: Ruff Style Check (3.13)

src/lfx/src/lfx/components/processing/text_operations.py

[failure] 348-348: Ruff (W293)
src/lfx/src/lfx/components/processing/text_operations.py:348:1: W293 Blank line contains whitespace

[failure] 331-331: Ruff (W293)
src/lfx/src/lfx/components/processing/text_operations.py:331:1: W293 Blank line contains whitespace

[failure] 325-325: Ruff (W293)
src/lfx/src/lfx/components/processing/text_operations.py:325:1: W293 Blank line contains whitespace

[failure] 308-308: Ruff (W293)
src/lfx/src/lfx/components/processing/text_operations.py:308:1: W293 Blank line contains whitespace

[failure] 302-302: Ruff (W293)
src/lfx/src/lfx/components/processing/text_operations.py:302:1: W293 Blank line contains whitespace

[failure] 253-253: Ruff (W293)
src/lfx/src/lfx/components/processing/text_operations.py:253:1: W293 Blank line contains whitespace

[failure] 239-239: Ruff (RUF013)
src/lfx/src/lfx/components/processing/text_operations.py:239:85: RUF013 PEP 484 prohibits implicit Optional

[failure] 3-3: Ruff (UP035)
src/lfx/src/lfx/components/processing/text_operations.py:3:1: UP035 typing.Dict is deprecated, use dict instead

[failure] 3-3: Ruff (UP035)
src/lfx/src/lfx/components/processing/text_operations.py:3:1: UP035 typing.List is deprecated, use list instead

[failure] 1-17: Ruff (I001)
src/lfx/src/lfx/components/processing/text_operations.py:1:1: I001 Import block is un-sorted or un-formatted

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Update Starter Projects
GitHub Check: Update Component Index

Cristhianzl

lgtm

Cristhianzl · 2026-01-09T11:52:10Z

@daniellicnerski1

Regarding Bug 1 (Input Format):
The component is already using MessageTextInput. Could you verify if you're testing the latest version?

  MessageTextInput(
      name="text_input",
      display_name="Text Input",
      info="The input text to process.",
      required=True,
  ),

Regarding Bug 8 (Text Strip not removing tabs):
Python's strip() method removes all whitespace by default, including tabs. The test "\t\thello world\t\t".strip() correctly returns "hello world".

Could you confirm if the input contained actual tab characters (\t) or literal backslash-t strings (\t)? If the issue persists, it might be related to how the upstream component (e.g., Read File) is passing the text.

All other reported bugs have been addressed and fixed:

2: Word Count now returns zeros for empty text
3: Text Extract raises clear error for invalid regex
4, 5, 6, 7: Text Head/Tail now validates non-negative values
9: Text Join works correctly with empty first input
10: Text Clean now removes ALL special characters consistently
11: Text to DataFrame validates header/data column count with clear error message

Automated regression tests have been added for all fixes.

Create text_operations.py

0866b86

Empreiteiro requested review from carlosrcoelho and daniellicnerski1 January 5, 2026 19:56

Empreiteiro assigned rodrigosnader and Empreiteiro Jan 5, 2026

[autofix.ci] apply automated fixes

e5fd28d

coderabbitai Bot reviewed Jan 5, 2026

View reviewed changes

[autofix.ci] apply automated fixes (attempt 2/3)

f4ce371

Empreiteiro removed the request for review from daniellicnerski1 January 5, 2026 21:26

Cristhianzl added 2 commits January 6, 2026 15:08

text operations refactor code

319b596

add missing operation fields

004121b

Cristhianzl changed the title ~~Create text_operations.py~~ feat: add text operations component Jan 6, 2026

merge fix

84b32d5

Cristhianzl approved these changes Jan 6, 2026

View reviewed changes

Cristhianzl enabled auto-merge January 6, 2026 18:14

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Jan 6, 2026

[autofix.ci] apply automated fixes

11a07ee

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Jan 6, 2026

Merge branch 'main' into text-operations

23074ac

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Jan 6, 2026

[autofix.ci] apply automated fixes

d861584

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Jan 6, 2026

Cristhianzl added this pull request to the merge queue Jan 6, 2026

Cristhianzl removed this pull request from the merge queue due to a manual request Jan 6, 2026

Merge branch 'main' into text-operations

5ca9206

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Jan 8, 2026

[autofix.ci] apply automated fixes

9edb614

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Jan 8, 2026

Cristhianzl added 3 commits January 9, 2026 15:20

change to MultilineInput

219d4dc

pull changes

4ef28a4

merge fix

f7aa7df