feat: add drop duplicates operation to DataFrameOperationsComponent#8665
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughA new "Drop Duplicates" operation was added to the DataFrameOperationsComponent, including UI configuration and operation handling. The perform_operation method and related logic were updated to support this operation, and a dedicated drop_duplicates method was introduced. Minor refactoring and logging enhancements were also applied. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant DataFrameOperationsComponent
participant Logger
User->>DataFrameOperationsComponent: Selects "Drop Duplicates" operation
DataFrameOperationsComponent->>DataFrameOperationsComponent: update_build_config (show column_name input)
User->>DataFrameOperationsComponent: Triggers perform_operation
DataFrameOperationsComponent->>DataFrameOperationsComponent: Checks selected operation
alt Operation is "Drop Duplicates"
DataFrameOperationsComponent->>DataFrameOperationsComponent: drop_duplicates(df)
DataFrameOperationsComponent-->>User: Returns DataFrame with duplicates dropped
else Unsupported operation
DataFrameOperationsComponent->>Logger: Log error
DataFrameOperationsComponent-->>User: Raises ValueError
end
✨ Finishing Touches🧪 Generate Unit Tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/backend/base/langflow/components/processing/dataframe_operations.py (1)
224-225: Consider enhancing for multiple column support.The implementation is correct for single-column deduplication. Consider enhancing it to support multiple columns, as pandas
drop_duplicates()can accept a list of column names.Potential enhancement:
def drop_duplicates(self, df: DataFrame) -> DataFrame: - return DataFrame(df.drop_duplicates(subset=self.column_name)) + # Handle both single column (string) and multiple columns (list) + subset = self.column_name if isinstance(self.column_name, str) else self.column_name + return DataFrame(df.drop_duplicates(subset=subset))This would allow future expansion to support multiple column selection in the UI.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/backend/base/langflow/components/processing/dataframe_operations.py(4 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
src/backend/base/langflow/components/processing/dataframe_operations.py
167-167: Avoid using the generic variable name df for DataFrames
(PD901)
🪛 GitHub Check: Ruff Style Check (3.13)
src/backend/base/langflow/components/processing/dataframe_operations.py
[failure] 167-167: Ruff (PD901)
src/backend/base/langflow/components/processing/dataframe_operations.py:167:9: PD901 Avoid using the generic variable name df for DataFrames
🪛 Pylint (3.3.7)
src/backend/base/langflow/components/processing/dataframe_operations.py
[refactor] 166-166: Too many return statements (10/6)
(R0911)
🪛 GitHub Actions: Ruff Style Check
src/backend/base/langflow/components/processing/dataframe_operations.py
[error] 167-167: PD901 Avoid using the generic variable name df for DataFrames.
🔇 Additional comments (5)
src/backend/base/langflow/components/processing/dataframe_operations.py (5)
2-10: LGTM! Clean import formatting.The import statement formatting improves readability and maintainability.
11-11: Good addition for improved error handling.Adding the logger import supports better error reporting in the perform_operation method.
19-19: Explicit component naming improves clarity.Adding the explicit name attribute enhances component identification and debugging.
31-31: New operation choice properly integrated.The "Drop Duplicates" operation is correctly added to the available choices.
161-162: UI configuration correctly implemented.The dynamic field configuration for "Drop Duplicates" follows the established pattern and properly shows the column_name field.
| df = self.df.copy() | ||
| op = self.operation | ||
|
|
||
| if op == "Filter": | ||
| return self.filter_rows_by_value(df) | ||
| if op == "Sort": | ||
| return self.sort_by_column(df) | ||
| if op == "Drop Column": | ||
| return self.drop_column(df) | ||
| if op == "Rename Column": | ||
| return self.rename_column(df) | ||
| if op == "Add Column": | ||
| return self.add_column(df) | ||
| if op == "Select Columns": | ||
| return self.select_columns(df) | ||
| if op == "Head": | ||
| return self.head(df) | ||
| if op == "Tail": | ||
| return self.tail(df) | ||
| if op == "Replace Value": | ||
| return self.replace_values(df) | ||
| if op == "Drop Duplicates": | ||
| return self.drop_duplicates(df) | ||
| msg = f"Unsupported operation: {op}" | ||
| logger.error(msg) | ||
| raise ValueError(msg) |
There was a problem hiding this comment.
Address static analysis issues in method refactoring.
While the refactoring adds good error logging and the new operation handling, there are two issues to address:
-
Variable naming: The generic variable name
dfviolates pandas best practices (PD901). Consider using a more descriptive name likedataframe_copyordata_frame. -
Method complexity: The method now has 10 return statements, exceeding the recommended threshold of 6. Consider refactoring using a dispatch pattern.
Apply this diff to fix the variable naming issue:
- df = self.df.copy()
+ dataframe_copy = self.df.copy()
op = self.operation
- if op == "Filter":
- return self.filter_rows_by_value(df)
+ if op == "Filter":
+ return self.filter_rows_by_value(dataframe_copy)And similarly update all other method calls to use dataframe_copy instead of df.
For the complexity issue, consider implementing a dispatch pattern:
+ operation_map = {
+ "Filter": self.filter_rows_by_value,
+ "Sort": self.sort_by_column,
+ "Drop Column": self.drop_column,
+ # ... etc
+ }
+
+ if op in operation_map:
+ return operation_map[op](dataframe_copy)Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.11.9)
167-167: Avoid using the generic variable name df for DataFrames
(PD901)
🪛 GitHub Check: Ruff Style Check (3.13)
[failure] 167-167: Ruff (PD901)
src/backend/base/langflow/components/processing/dataframe_operations.py:167:9: PD901 Avoid using the generic variable name df for DataFrames
🪛 GitHub Actions: Ruff Style Check
[error] 167-167: PD901 Avoid using the generic variable name df for DataFrames.
🤖 Prompt for AI Agents
In src/backend/base/langflow/components/processing/dataframe_operations.py
around lines 167 to 192, rename the variable 'df' to a more descriptive name
like 'dataframe_copy' and update all method calls accordingly to follow pandas
best practices. To reduce method complexity caused by multiple return
statements, refactor the code to use a dispatch pattern by creating a dictionary
that maps operation names to their corresponding methods, then call the
appropriate method via this dictionary, ensuring only one return statement is
used.
Cristhianzl
left a comment
There was a problem hiding this comment.
@edwinjosechittilappilly
can we make the required fields to run the component as required?
otherwise the component will fail when running it.
Error building Component DataFrame Operations: Index([''], dtype='object')
…angflow-ai#8665) * Update dataframe_operations.py * [autofix.ci] apply automated fixes * Update dataframe_operations.py --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>


Summary by CodeRabbit