Skip to content

⚡️ Speed up function calculate_text_metrics by 16% in PR #6732 (better-langflow-base)#11332

Closed
codeflash-ai[bot] wants to merge 29 commits into
mainfrom
codeflash/optimize-pr6732-2026-01-16T20.00.34
Closed

⚡️ Speed up function calculate_text_metrics by 16% in PR #6732 (better-langflow-base)#11332
codeflash-ai[bot] wants to merge 29 commits into
mainfrom
codeflash/optimize-pr6732-2026-01-16T20.00.34

Conversation

@codeflash-ai
Copy link
Copy Markdown
Contributor

@codeflash-ai codeflash-ai Bot commented Jan 16, 2026

⚡️ This pull request contains optimizations for PR #6732

If you approve this dependent PR, these changes will be merged into the original PR branch better-langflow-base.

This PR will be automatically closed if the original PR is merged.


📄 16% (0.16x) speedup for calculate_text_metrics in src/backend/base/langflow/api/v1/knowledge_bases.py

⏱️ Runtime : 46.5 milliseconds 40.0 milliseconds (best of 96 runs)

📝 Explanation and details

The optimized code achieves a 16% speedup (from 46.5ms to 40.0ms) through two key algorithmic improvements:

1. Vectorized Regex Word Counting (Primary Optimization)

What changed:

  • Original: text_series.str.split().str.len().sum() - splits every string into a Python list of words, then counts list lengths
  • Optimized: text_series.str.count(_WORD_RE).sum() with precompiled regex r'\S+' - counts non-whitespace sequences directly without materializing lists

Why it's faster:
The original approach creates intermediate Python list objects for every row during .str.split(), which triggers significant memory allocation and garbage collection overhead. The optimized version uses pandas' vectorized regex counting that operates at the C level, avoiding the costly list materialization step entirely.

Performance impact from profiler:

  • Original word counting: 73.2ms (42.2% of total time)
  • Optimized word counting: 43.5ms (30.6% of total time)
  • ~41% reduction in this operation alone

The precompiled regex _WORD_RE is defined once at module load, eliminating repeated pattern compilation on every call.

2. Set-Based Column Membership Check

What changed:

  • Original: if col not in df.columns - checks membership against pandas Index
  • Optimized: columns_set = set(df.columns) followed by if col not in columns_set

Why it's faster:
Set lookups are O(1) vs O(n) for pandas Index sequential search. With multiple columns to check, this adds up.

Performance impact from profiler:

  • Original column checks: 2.25ms (1.3% of total time)
  • Optimized column checks: 0.08ms (0.1% of total time)
  • ~96% reduction in this operation

Test Case Performance

The optimization excels across all test categories:

  • Large-scale tests (500+ rows): Maximum benefit from vectorized operations avoiding per-row list creation
  • Multiple column tests: Set-based membership check overhead pays off when checking multiple columns
  • Unicode/emoji tests: Regex approach handles these correctly while maintaining performance
  • Edge cases (empty strings, None values): Behavior preserved via .fillna("") and regex semantics

The optimization maintains correctness because \S+ (non-whitespace sequences) matches the same word boundaries as .split() for all practical text inputs, while being significantly more efficient at the pandas/numpy vectorization level.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 51 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 91.7%
🌀 Click to see Generated Regression Tests
import numpy as np
import pandas as pd
# imports
import pytest  # used for our unit tests
from langflow.api.v1.knowledge_bases import calculate_text_metrics


# function to test
def _to_int(value) -> int:
    """Convert a pandas/numpy scalar to int, handling both old and new pandas versions."""
    # Newer pandas returns native Python types, older versions return numpy scalars with .item()
    if hasattr(value, "item"):
        return int(value.item())
    return int(value)


# unit tests

def test_basic_single_column():
    # Basic functionality: one text column with simple sentences
    df = pd.DataFrame({
        "text": ["hello world", "foo"]
    })
    # "hello world" -> 2 words, length 11; "foo" -> 1 word, length 3
    expected_words = 3
    expected_chars = len("hello world") + len("foo")  # 11 + 3 = 14

    words, chars = calculate_text_metrics(df, ["text"])


def test_basic_multiple_columns():
    # Multiple text columns should aggregate across columns
    df = pd.DataFrame({
        "a": ["one two", "three"],
        "b": ["x", "y z"]
    })
    # Column 'a': 3 words, lengths = len("one two") + len("three") = 7 + 5 = 12
    # Column 'b': 3 words, lengths = len("x") + len("y z") = 1 + 3 = 4
    expected_words = 3 + 3
    expected_chars = 12 + 4

    words, chars = calculate_text_metrics(df, ["a", "b"])


def test_missing_columns_ignored():
    # If a column name in text_columns is missing from df, it should be ignored
    df = pd.DataFrame({"text": ["alpha beta"]})
    # Provide a missing column name "missing" as well
    words, chars = calculate_text_metrics(df, ["missing", "text", "also_missing"])


def test_non_string_and_nan_handling():
    # The function uses .astype(str).fillna("") - check conversion of non-string values
    df = pd.DataFrame({
        "text": [None, np.nan, 123, True, ""]
    })

    # Note: astype(str) converts None -> "None" and np.nan -> "nan" (string)
    # So we expect them to be counted as strings.
    expected_strings = [str(None), str(np.nan), str(123), str(True), ""]
    # Compute expected counts explicitly to avoid depending on implementation
    expected_chars = sum(len(s) for s in expected_strings)
    expected_words = sum(len(s.split()) for s in expected_strings)  # split on whitespace

    words, chars = calculate_text_metrics(df, ["text"])


def test_whitespace_and_multispace_handling():
    # Multiple spaces, newlines and tabs should be treated as whitespace by .str.split()
    df = pd.DataFrame({
        "text": ["  hello   world \n tab\tend  ", "\n\nsingle"]
    })

    # For first string: words are ["hello", "world", "tab", "end"] -> 4
    # For second string: ["single"] -> 1
    expected_words = 4 + 1
    # Characters counted including whitespace characters
    expected_chars = len("  hello   world \n tab\tend  ") + len("\n\nsingle")

    words, chars = calculate_text_metrics(df, ["text"])


def test_unicode_and_emoji_handling():
    # Ensure unicode (including emojis) are counted correctly.
    df = pd.DataFrame({
        "text": ["café naïve", "emoji 🙂 test", "中文 字"]
    })
    # Words: "café naïve" -> 2, "emoji 🙂 test" -> 3 (emoji is its own token), "中文 字" -> 2
    expected_words = 2 + 3 + 2
    # Characters measured in Python are codepoints (len counts codepoints)
    expected_chars = len("café naïve") + len("emoji 🙂 test") + len("中文 字")

    words, chars = calculate_text_metrics(df, ["text"])


def test_duplicate_columns_counted_twice():
    # If same column appears multiple times in text_columns, it should be processed each time
    df = pd.DataFrame({
        "text": ["a b", "c"]
    })
    # Single column would give: words = 3, chars = len("a b") + len("c") = 3 + 1 = 4
    single_words, single_chars = calculate_text_metrics(df, ["text"])

    # If we provide the column twice, the totals should double
    double_words, double_chars = calculate_text_metrics(df, ["text", "text"])


def test_empty_text_columns_and_empty_df():
    # If text_columns is empty, result should be zeros regardless of df
    df = pd.DataFrame({"a": ["one"]})
    words, chars = calculate_text_metrics(df, [])

    # If df is empty, counts should be zero even if columns are listed
    empty_df = pd.DataFrame(columns=["text"])
    words, chars = calculate_text_metrics(empty_df, ["text"])


def test_large_scale_correctness_and_types():
    # Large-scale-style test but keeps data structure under 1000 elements for this environment
    num_rows = 250  # keep under 1000 as requested
    # Use repeating short strings to make expected values simple to compute
    base_strings = ["alpha beta", "gamma"]
    # Build DataFrame with two text columns and 250 rows: total elements = 250 * 2 = 500 (<1000)
    df = pd.DataFrame({
        "col1": [base_strings[0]] * num_rows,
        "col2": [base_strings[1]] * num_rows
    })

    # For col1: each row -> 2 words, len("alpha beta") = 10
    # For col2: each row -> 1 word, len("gamma") = 5
    expected_words = num_rows * (2 + 1)
    expected_chars = num_rows * (10 + 5)

    words, chars = calculate_text_metrics(df, ["col1", "col2"])
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pandas as pd
import pytest
from langflow.api.v1.knowledge_bases import calculate_text_metrics


class TestCalculateTextMetricsBasic:
    """Basic test cases for calculate_text_metrics function under normal conditions."""

    def test_single_column_single_row(self):
        """Test with a single text column and single row of data."""
        df = pd.DataFrame({"text": ["hello world"]})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_single_column_multiple_rows(self):
        """Test with a single column containing multiple rows."""
        df = pd.DataFrame({"text": ["hello", "world", "test"]})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_multiple_columns(self):
        """Test with multiple text columns."""
        df = pd.DataFrame({
            "col1": ["hello world"],
            "col2": ["foo bar"]
        })
        words, characters = calculate_text_metrics(df, ["col1", "col2"])

    def test_single_word_entries(self):
        """Test with entries that contain only single words."""
        df = pd.DataFrame({"text": ["hello", "world", "python"]})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_text_with_punctuation(self):
        """Test that punctuation is counted as part of characters."""
        df = pd.DataFrame({"text": ["hello, world!"]})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_text_with_numbers(self):
        """Test text containing numeric values."""
        df = pd.DataFrame({"text": ["test 123 example"]})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_mixed_whitespace(self):
        """Test handling of different whitespace characters."""
        df = pd.DataFrame({"text": ["hello  world"]})  # double space
        words, characters = calculate_text_metrics(df, ["text"])


class TestCalculateTextMetricsEdgeCases:
    """Edge case test cases for unusual or extreme conditions."""

    def test_empty_dataframe(self):
        """Test with empty dataframe."""
        df = pd.DataFrame({"text": []})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_empty_string_entries(self):
        """Test with empty string values."""
        df = pd.DataFrame({"text": ["", "", ""]})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_none_values_in_column(self):
        """Test with None/NaN values in the column."""
        df = pd.DataFrame({"text": ["hello", None, "world"]})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_column_not_in_dataframe(self):
        """Test when specified column doesn't exist in dataframe."""
        df = pd.DataFrame({"other_col": ["hello"]})
        words, characters = calculate_text_metrics(df, ["missing_col"])

    def test_mix_of_existing_and_missing_columns(self):
        """Test when some columns exist and some don't."""
        df = pd.DataFrame({"col1": ["hello world"], "col2": ["foo"]})
        words, characters = calculate_text_metrics(df, ["col1", "missing", "col2"])

    def test_whitespace_only_strings(self):
        """Test with strings that contain only whitespace."""
        df = pd.DataFrame({"text": ["   ", "\t", "\n"]})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_single_character_entries(self):
        """Test with single character entries."""
        df = pd.DataFrame({"text": ["a", "b", "c"]})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_numeric_datatype_column(self):
        """Test conversion of numeric columns to string."""
        df = pd.DataFrame({"numbers": [123, 456, 789]})
        words, characters = calculate_text_metrics(df, ["numbers"])

    def test_float_datatype_column(self):
        """Test conversion of float columns to string."""
        df = pd.DataFrame({"floats": [1.5, 2.7, 3.14]})
        words, characters = calculate_text_metrics(df, ["floats"])

    def test_boolean_datatype_column(self):
        """Test conversion of boolean columns to string."""
        df = pd.DataFrame({"bools": [True, False, True]})
        words, characters = calculate_text_metrics(df, ["bools"])

    def test_very_long_single_word(self):
        """Test with extremely long single word."""
        long_word = "a" * 1000
        df = pd.DataFrame({"text": [long_word]})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_empty_column_list(self):
        """Test with empty column list."""
        df = pd.DataFrame({"text": ["hello world"]})
        words, characters = calculate_text_metrics(df, [])

    def test_duplicate_column_names_in_list(self):
        """Test when same column is specified multiple times."""
        df = pd.DataFrame({"text": ["hello"]})
        words, characters = calculate_text_metrics(df, ["text", "text"])

    def test_case_sensitivity_in_column_names(self):
        """Test that column name lookup is case-sensitive."""
        df = pd.DataFrame({"Text": ["hello"]})
        words, characters = calculate_text_metrics(df, ["text"])  # lowercase

    def test_special_characters_and_unicode(self):
        """Test with special characters and unicode characters."""
        df = pd.DataFrame({"text": ["héllo wørld 你好"]})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_tabs_as_word_separators(self):
        """Test that tabs are treated as word separators."""
        df = pd.DataFrame({"text": ["hello\tworld\ttest"]})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_newlines_as_word_separators(self):
        """Test that newlines are treated as word separators."""
        df = pd.DataFrame({"text": ["hello\nworld\ntest"]})
        words, characters = calculate_text_metrics(df, ["text"])


class TestCalculateTextMetricsLargeScale:
    """Large scale test cases for performance and scalability assessment."""

    def test_large_number_of_rows(self):
        """Test with large number of rows in dataframe."""
        # Create 500 rows with varying text content
        texts = ["hello world"] * 500
        df = pd.DataFrame({"text": texts})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_large_text_content(self):
        """Test with large text content in single entry."""
        # Create a single entry with many words
        large_text = " ".join(["word"] * 500)
        df = pd.DataFrame({"text": [large_text]})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_many_columns(self):
        """Test with many text columns."""
        # Create 50 columns with text
        data = {f"col_{i}": ["hello world"] for i in range(50)}
        df = pd.DataFrame(data)
        column_names = [f"col_{i}" for i in range(50)]
        words, characters = calculate_text_metrics(df, column_names)

    def test_mixed_content_large_dataframe(self):
        """Test with mixed content types in large dataframe."""
        # Create 300 rows with mixed content
        texts = [
            "hello world",
            "foo bar baz",
            "single",
            "",
            None,
            "a b c d e f g h i j",
        ] * 50
        df = pd.DataFrame({"text": texts})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_multiple_columns_large_dataframe(self):
        """Test with multiple columns and large dataframe."""
        # Create 200 rows with 5 columns
        df = pd.DataFrame({
            "col1": ["hello"] * 200,
            "col2": ["world test"] * 200,
            "col3": ["foo bar baz"] * 200,
            "col4": ["single"] * 200,
            "col5": ["a b c"] * 200,
        })
        words, characters = calculate_text_metrics(df, ["col1", "col2", "col3", "col4", "col5"])

    def test_very_long_words_in_large_data(self):
        """Test performance with very long words in large dataset."""
        # Create 100 rows with long words
        long_word = "a" * 100
        texts = [f"{long_word} {long_word}"] * 100
        df = pd.DataFrame({"text": texts})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_sparse_data_large_dataframe(self):
        """Test with sparse data (many empty/None values) in large dataframe."""
        # Create 400 rows where 75% are empty or None
        texts = ["hello world"] * 100 + [""] * 150 + [None] * 150
        df = pd.DataFrame({"text": texts})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_categorical_data_large_scale(self):
        """Test with categorical-like data in large dataset."""
        # Create 300 rows with repeated categories
        categories = ["product review", "customer feedback", "user comment"] * 100
        df = pd.DataFrame({"text": categories})
        words, characters = calculate_text_metrics(df, ["text"])

    def test_incremental_data_accumulation(self):
        """Test that metrics accumulate correctly across rows."""
        # Create incrementally larger dataset and verify accumulation
        for size in [10, 50, 100, 200]:
            df = pd.DataFrame({"text": ["hello world"] * size})
            words, characters = calculate_text_metrics(df, ["text"])

    def test_memory_efficient_processing(self):
        """Test that function processes data efficiently without excessive memory use."""
        # Create a moderately large dataframe
        df = pd.DataFrame({
            "text1": ["hello world test"] * 300,
            "text2": ["foo bar baz"] * 300,
            "text3": ["single word"] * 300,
        })
        # This should complete without issues
        words, characters = calculate_text_metrics(df, ["text1", "text2", "text3"])

    def test_consistency_across_multiple_calls(self):
        """Test that multiple calls return consistent results."""
        df = pd.DataFrame({"text": ["hello world"] * 250})
        # Call function multiple times
        codeflash_output = calculate_text_metrics(df, ["text"]); result1 = codeflash_output
        codeflash_output = calculate_text_metrics(df, ["text"]); result2 = codeflash_output
        codeflash_output = calculate_text_metrics(df, ["text"]); result3 = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr6732-2026-01-16T20.00.34 and push.

Codeflash

ogabrielluiz and others added 29 commits June 12, 2025 13:49
…on and restructure complete installation groups
…local' to 'complete' installation and remove dev extra
# Conflicts:
#	pyproject.toml
#	src/backend/base/pyproject.toml
#	uv.lock
# Conflicts:
#	src/lfx/src/lfx/_assets/component_index.json
The optimized code achieves a **16% speedup** (from 46.5ms to 40.0ms) through two key algorithmic improvements:

## **1. Vectorized Regex Word Counting (Primary Optimization)**

**What changed:** 
- **Original:** `text_series.str.split().str.len().sum()` - splits every string into a Python list of words, then counts list lengths
- **Optimized:** `text_series.str.count(_WORD_RE).sum()` with precompiled regex `r'\S+'` - counts non-whitespace sequences directly without materializing lists

**Why it's faster:**
The original approach creates intermediate Python list objects for every row during `.str.split()`, which triggers significant memory allocation and garbage collection overhead. The optimized version uses pandas' vectorized regex counting that operates at the C level, avoiding the costly list materialization step entirely.

**Performance impact from profiler:**
- Original word counting: **73.2ms** (42.2% of total time)
- Optimized word counting: **43.5ms** (30.6% of total time)
- **~41% reduction** in this operation alone

The precompiled regex `_WORD_RE` is defined once at module load, eliminating repeated pattern compilation on every call.

## **2. Set-Based Column Membership Check**

**What changed:**
- **Original:** `if col not in df.columns` - checks membership against pandas Index
- **Optimized:** `columns_set = set(df.columns)` followed by `if col not in columns_set`

**Why it's faster:**
Set lookups are O(1) vs O(n) for pandas Index sequential search. With multiple columns to check, this adds up.

**Performance impact from profiler:**
- Original column checks: **2.25ms** (1.3% of total time)  
- Optimized column checks: **0.08ms** (0.1% of total time)
- **~96% reduction** in this operation

## **Test Case Performance**

The optimization excels across all test categories:
- **Large-scale tests** (500+ rows): Maximum benefit from vectorized operations avoiding per-row list creation
- **Multiple column tests**: Set-based membership check overhead pays off when checking multiple columns
- **Unicode/emoji tests**: Regex approach handles these correctly while maintaining performance
- **Edge cases** (empty strings, None values): Behavior preserved via `.fillna("")` and regex semantics

The optimization maintains correctness because `\S+` (non-whitespace sequences) matches the same word boundaries as `.split()` for all practical text inputs, while being significantly more efficient at the pandas/numpy vectorization level.
@codeflash-ai codeflash-ai Bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jan 16, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 16, 2026

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the community Pull Request from an external contributor label Jan 16, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 16, 2026

Codecov Report

❌ Patch coverage is 25.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.24%. Comparing base (6b4f946) to head (12c5144).
⚠️ Report is 162 commits behind head on main.

Files with missing lines Patch % Lines
...rc/backend/base/langflow/api/v1/knowledge_bases.py 30.00% 7 Missing ⚠️
src/backend/base/langflow/api/build.py 0.00% 1 Missing ⚠️
src/lfx/src/lfx/base/models/unified_models.py 0.00% 1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (25.00%) is below the target coverage (40.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (40.80%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #11332      +/-   ##
==========================================
- Coverage   34.24%   34.24%   -0.01%     
==========================================
  Files        1409     1409              
  Lines       66929    66936       +7     
  Branches     9877     9877              
==========================================
+ Hits        22918    22919       +1     
- Misses      42810    42816       +6     
  Partials     1201     1201              
Flag Coverage Δ
backend 53.52% <27.27%> (-0.02%) ⬇️
lfx 40.80% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/backend/base/langflow/api/build.py 71.01% <0.00%> (-0.73%) ⬇️
src/lfx/src/lfx/base/models/unified_models.py 23.74% <0.00%> (ø)
...rc/backend/base/langflow/api/v1/knowledge_bases.py 17.03% <30.00%> (+0.68%) ⬆️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Base automatically changed from better-langflow-base to main January 21, 2026 23:20
@ogabrielluiz
Copy link
Copy Markdown
Contributor

Closing automated codeflash PR.

@codeflash-ai codeflash-ai Bot deleted the codeflash/optimize-pr6732-2026-01-16T20.00.34 branch March 3, 2026 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI community Pull Request from an external contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants