feat: add text operations component#11201
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughA new TextOperations component is introduced with dynamic operation selection, flexible input/output configuration, and support for ten text operations including word count, case conversion, text replacement, extraction, head/tail, stripping, joining, cleaning, and markdown table to DataFrame conversion. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Pre-merge checks and finishing touches❌ Failed checks (1 error, 2 warnings)
✅ Passed checks (4 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Fix all issues with AI Agents 🤖
In @src/lfx/src/lfx/components/processing/text_operations.py:
- Line 253: The file src/lfx/src/lfx/components/processing/text_operations.py
contains blank lines with trailing whitespace (notably around the areas
indicated: lines ~253, 302, 308, 325, 331, 348); remove the trailing spaces on
those blank lines so they are truly empty (you can run a whitespace-trimming
command such as sed -i 's/[[:space:]]\+$//'
src/lfx/src/lfx/components/processing/text_operations.py or use your editor’s
trim-trailing-whitespace feature), then verify there are no remaining trailing
spaces on blank lines and commit the cleaned file.
- Around line 594-614: The text_split method is dead/invalid: either remove it
or wire it into the component by adding "Text Split" to OPERATION_CHOICES,
adding inputs for split_delimiter and max_splits to the inputs list, and
updating the dispatcher in process_text to call text_split when operation ==
"Text Split", plus ensure update_build_config and update_outputs handle the new
operation and outputs; alternatively delete text_split if unused. Specifically,
if keeping it, add "Text Split" to OPERATION_CHOICES, declare input fields
split_delimiter (string, default ",") and max_splits (int, default -1) in the
inputs definition, add a branch in process_text to call self.text_split(text)
and set self._result appropriately, and update
update_build_config/update_outputs to include the output schema and persistence
for the Text Split operation so the method is reachable and its inputs are
defined.
- Line 239: The function signature for update_build_config uses a parameter with
default None (field_name: str = None) which must be explicitly typed per PEP
484; update the annotation to use Optional[str] (or str | None for Python 3.10+)
and add the corresponding import from typing if necessary so the signature
becomes field_name: Optional[str] = None (or field_name: str | None = None).
- Around line 707-722: get_data currently calls process_text() again causing
duplicate processing; change it to reuse the cached result in self._result like
get_text/get_dataframe: check operation via get_operation_name(), and for "Word
Count" ensure you only call self.process_text() if self._result is None
(assigning its return to self._result), then build and return the Data object
from self._result (handling dict/list/other cases) instead of calling
process_text() directly.
- Around line 413-416: The try/except block around pd.to_numeric(df[col],
errors='ignore') is using a bare except which can hide system exceptions; change
it to catch only the expected exceptions (ValueError and TypeError) so failures
converting df[col] are handled without swallowing KeyboardInterrupt/SystemExit;
update the except to "except (ValueError, TypeError):" and keep the pass (or
optionally log the conversion failure) while leaving pd.to_numeric and df[col]
intact.
- Around line 671-683: get_dataframe currently re-invokes text_to_dataframe
instead of reusing the already computed result stored on the instance; change
get_dataframe (and specifically the "Text to DataFrame" branch) to check
self._result first and, if it exists and is a DataFrame (or a wrapper DataFrame
type), return that directly; only call self.text_to_dataframe(text) as a
fallback if self._result is absent, and ensure any returned value is
wrapped/converted to the expected DataFrame type to preserve existing behavior.
- Around line 685-705: The get_text method references a non-existent "Text
Format" operation and re-invokes processing causing duplicate work; remove "Text
Format" from the text_operations list and change the logic to use the cached
result (self._result) if present, only calling process_text() when self._result
is None; keep the rest of the list (Case Conversion, Text Replace, Text Extract,
Text Head, Text Tail, Text Strip, Text Join, Text Clean), then format the cached
or newly produced result (handle list vs scalar) into Message(text=...) and
return it, otherwise return an empty Message.
- Around line 1-17: Imports must be alphabetized and grouped (stdlib,
third-party, local) and the deprecated typing aliases removed: delete List and
Dict from the typing import and reorder the import block accordingly; then
replace all occurrences of Dict[str, Any] with dict[str, Any] and all
occurrences of List[str] with list[str] (search for the type hints used inside
the text operation functions and return/parameter annotations such as the
mapping/dataset handlers and any variables that previously used List or Dict) so
the module uses built-in generic types and the imports are sorted.
🧹 Nitpick comments (2)
src/lfx/src/lfx/components/processing/text_operations.py (2)
557-591: Consider simplifying the strip logic.The
elseclauses on lines 571-572 and 581-582 duplicate the "both" case logic. This makes the code harder to maintain.🔎 Proposed refactor
def text_strip(self, text: str) -> str: """Remove whitespace or specific characters from the beginning and/or end of text.""" try: strip_mode = getattr(self, "strip_mode", "both") strip_characters = getattr(self, "strip_characters", "") + # Determine strip function based on mode + strip_fn = { + "both": text.strip, + "left": text.lstrip, + "right": text.rstrip, + }.get(strip_mode, text.strip) + - if strip_characters: - # Strip specific characters - if strip_mode == "both": - result = text.strip(strip_characters) - elif strip_mode == "left": - result = text.lstrip(strip_characters) - elif strip_mode == "right": - result = text.rstrip(strip_characters) - else: - result = text.strip(strip_characters) - else: - # Strip whitespace (default behavior) - if strip_mode == "both": - result = text.strip() - elif strip_mode == "left": - result = text.lstrip() - elif strip_mode == "right": - result = text.rstrip() - else: - result = text.strip() + # Apply strip with or without specific characters + result = strip_fn(strip_characters) if strip_characters else strip_fn() self._result = result removed_chars = len(text) - len(result) self.log(f"Stripped {removed_chars} characters from {strip_mode} side(s)") return result except Exception as e: self.log(f"Error stripping text: {str(e)}") return text
234-237: Remove unused_operation_resultinstance variable.The
_operation_resultattribute is initialized but never used anywhere in the code. Only_resultis actually utilized.🔎 Proposed fix
def __init__(self, **kwargs): super().__init__(**kwargs) self._result = None - self._operation_result = None
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/lfx/src/lfx/components/processing/text_operations.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/lfx/src/lfx/components/processing/text_operations.py (2)
src/lfx/src/lfx/schema/data.py (1)
Data(26-288)src/lfx/src/lfx/schema/message.py (1)
Message(34-315)
🪛 GitHub Actions: Ruff Style Check
src/lfx/src/lfx/components/processing/text_operations.py
[error] 1-1: I001 Import block is un-sorted or un-formatted
🪛 GitHub Check: Ruff Style Check (3.13)
src/lfx/src/lfx/components/processing/text_operations.py
[failure] 348-348: Ruff (W293)
src/lfx/src/lfx/components/processing/text_operations.py:348:1: W293 Blank line contains whitespace
[failure] 331-331: Ruff (W293)
src/lfx/src/lfx/components/processing/text_operations.py:331:1: W293 Blank line contains whitespace
[failure] 325-325: Ruff (W293)
src/lfx/src/lfx/components/processing/text_operations.py:325:1: W293 Blank line contains whitespace
[failure] 308-308: Ruff (W293)
src/lfx/src/lfx/components/processing/text_operations.py:308:1: W293 Blank line contains whitespace
[failure] 302-302: Ruff (W293)
src/lfx/src/lfx/components/processing/text_operations.py:302:1: W293 Blank line contains whitespace
[failure] 253-253: Ruff (W293)
src/lfx/src/lfx/components/processing/text_operations.py:253:1: W293 Blank line contains whitespace
[failure] 239-239: Ruff (RUF013)
src/lfx/src/lfx/components/processing/text_operations.py:239:85: RUF013 PEP 484 prohibits implicit Optional
[failure] 3-3: Ruff (UP035)
src/lfx/src/lfx/components/processing/text_operations.py:3:1: UP035 typing.Dict is deprecated, use dict instead
[failure] 3-3: Ruff (UP035)
src/lfx/src/lfx/components/processing/text_operations.py:3:1: UP035 typing.List is deprecated, use list instead
[failure] 1-17: Ruff (I001)
src/lfx/src/lfx/components/processing/text_operations.py:1:1: I001 Import block is un-sorted or un-formatted
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Update Starter Projects
- GitHub Check: Update Component Index
|
Regarding Bug 1 (Input Format): Regarding Bug 8 (Text Strip not removing tabs): Could you confirm if the input contained actual tab characters (\t) or literal backslash-t strings (\t)? If the issue persists, it might be related to how the upstream component (e.g., Read File) is passing the text. All other reported bugs have been addressed and fixed:
Automated regression tests have been added for all fixes. |
Adding a new component that performs various operations with text.
Operations Available
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.