⚡️ Speed up function calculate_text_metrics by 16% in PR #6732 (better-langflow-base)#11332
⚡️ Speed up function calculate_text_metrics by 16% in PR #6732 (better-langflow-base)#11332codeflash-ai[bot] wants to merge 29 commits into
calculate_text_metrics by 16% in PR #6732 (better-langflow-base)#11332Conversation
…rganization and clarity
…ctionality and organization
… comprehensive support
…ndencies in pyproject.toml
…on and restructure complete installation groups
…for existing packages in pyproject.toml
…local' to 'complete' installation and remove dev extra
… in pyproject.toml
# Conflicts: # pyproject.toml # src/backend/base/pyproject.toml # uv.lock
# Conflicts: # src/lfx/src/lfx/_assets/component_index.json
The optimized code achieves a **16% speedup** (from 46.5ms to 40.0ms) through two key algorithmic improvements:
## **1. Vectorized Regex Word Counting (Primary Optimization)**
**What changed:**
- **Original:** `text_series.str.split().str.len().sum()` - splits every string into a Python list of words, then counts list lengths
- **Optimized:** `text_series.str.count(_WORD_RE).sum()` with precompiled regex `r'\S+'` - counts non-whitespace sequences directly without materializing lists
**Why it's faster:**
The original approach creates intermediate Python list objects for every row during `.str.split()`, which triggers significant memory allocation and garbage collection overhead. The optimized version uses pandas' vectorized regex counting that operates at the C level, avoiding the costly list materialization step entirely.
**Performance impact from profiler:**
- Original word counting: **73.2ms** (42.2% of total time)
- Optimized word counting: **43.5ms** (30.6% of total time)
- **~41% reduction** in this operation alone
The precompiled regex `_WORD_RE` is defined once at module load, eliminating repeated pattern compilation on every call.
## **2. Set-Based Column Membership Check**
**What changed:**
- **Original:** `if col not in df.columns` - checks membership against pandas Index
- **Optimized:** `columns_set = set(df.columns)` followed by `if col not in columns_set`
**Why it's faster:**
Set lookups are O(1) vs O(n) for pandas Index sequential search. With multiple columns to check, this adds up.
**Performance impact from profiler:**
- Original column checks: **2.25ms** (1.3% of total time)
- Optimized column checks: **0.08ms** (0.1% of total time)
- **~96% reduction** in this operation
## **Test Case Performance**
The optimization excels across all test categories:
- **Large-scale tests** (500+ rows): Maximum benefit from vectorized operations avoiding per-row list creation
- **Multiple column tests**: Set-based membership check overhead pays off when checking multiple columns
- **Unicode/emoji tests**: Regex approach handles these correctly while maintaining performance
- **Edge cases** (empty strings, None values): Behavior preserved via `.fillna("")` and regex semantics
The optimization maintains correctness because `\S+` (non-whitespace sequences) matches the same word boundaries as `.split()` for all practical text inputs, while being significantly more efficient at the pandas/numpy vectorization level.
|
Important Review skippedBot user detected. To trigger a single review, invoke the You can disable this status message by setting the Comment |
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (25.00%) is below the target coverage (40.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #11332 +/- ##
==========================================
- Coverage 34.24% 34.24% -0.01%
==========================================
Files 1409 1409
Lines 66929 66936 +7
Branches 9877 9877
==========================================
+ Hits 22918 22919 +1
- Misses 42810 42816 +6
Partials 1201 1201
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Closing automated codeflash PR. |
⚡️ This pull request contains optimizations for PR #6732
If you approve this dependent PR, these changes will be merged into the original PR branch
better-langflow-base.📄 16% (0.16x) speedup for
calculate_text_metricsinsrc/backend/base/langflow/api/v1/knowledge_bases.py⏱️ Runtime :
46.5 milliseconds→40.0 milliseconds(best of96runs)📝 Explanation and details
The optimized code achieves a 16% speedup (from 46.5ms to 40.0ms) through two key algorithmic improvements:
1. Vectorized Regex Word Counting (Primary Optimization)
What changed:
text_series.str.split().str.len().sum()- splits every string into a Python list of words, then counts list lengthstext_series.str.count(_WORD_RE).sum()with precompiled regexr'\S+'- counts non-whitespace sequences directly without materializing listsWhy it's faster:
The original approach creates intermediate Python list objects for every row during
.str.split(), which triggers significant memory allocation and garbage collection overhead. The optimized version uses pandas' vectorized regex counting that operates at the C level, avoiding the costly list materialization step entirely.Performance impact from profiler:
The precompiled regex
_WORD_REis defined once at module load, eliminating repeated pattern compilation on every call.2. Set-Based Column Membership Check
What changed:
if col not in df.columns- checks membership against pandas Indexcolumns_set = set(df.columns)followed byif col not in columns_setWhy it's faster:
Set lookups are O(1) vs O(n) for pandas Index sequential search. With multiple columns to check, this adds up.
Performance impact from profiler:
Test Case Performance
The optimization excels across all test categories:
.fillna("")and regex semanticsThe optimization maintains correctness because
\S+(non-whitespace sequences) matches the same word boundaries as.split()for all practical text inputs, while being significantly more efficient at the pandas/numpy vectorization level.✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-pr6732-2026-01-16T20.00.34and push.