Skip to content

⚡️ Speed up function calculate_text_metrics by 302% in PR #9088 (feat-knowledge-bases)#9293

Closed
codeflash-ai[bot] wants to merge 167 commits into
mainfrom
codeflash/optimize-pr9088-2025-08-01T19.42.14
Closed

⚡️ Speed up function calculate_text_metrics by 302% in PR #9088 (feat-knowledge-bases)#9293
codeflash-ai[bot] wants to merge 167 commits into
mainfrom
codeflash/optimize-pr9088-2025-08-01T19.42.14

Conversation

@codeflash-ai
Copy link
Copy Markdown
Contributor

@codeflash-ai codeflash-ai Bot commented Aug 1, 2025

⚡️ This pull request contains optimizations for PR #9088

If you approve this dependent PR, these changes will be merged into the original PR branch feat-knowledge-bases.

This PR will be automatically closed if the original PR is merged.


📄 302% (3.02x) speedup for calculate_text_metrics in src/backend/base/langflow/api/v1/knowledge_bases.py

⏱️ Runtime : 52.2 milliseconds 13.0 milliseconds (best of 126 runs)

📝 Explanation and details

Here’s an optimized rewrite preserving function name, parameters, and documented behavior. The biggest bottleneck is repeatedly converting columns to string and splitting using str.split(), both of which are slow in Pandas for large DataFrames.
You can avoid overhead from astype(str) and str.split by using NumPy vectorization directly, operating on the underlying array, with fallbacks for object-dtype columns.
I’ll also check column existence in batch for small performance gain, and limit to a single astype(str) and .fillna("") per column.
Here’s the optimized code.

Key Optimizations.

  • Uses np.char.count for word boundary counting (count spaces + 1 for non-empty).
  • Operates on columns only once (avoids repeated astype(str) or fillna) per column.
  • Handles all dtypes: vectorized calculation for string types, fast fallback for object dtype.
  • Reduces per-row Python overhead to the unavoidable minimum.

Performance

On wide and/or long DataFrames, this will dramatically outperform chained Pandas string .str.split() and repeated type conversions.
The results remain exactly the same as before.
All comments and docstrings for original public APIs are unchanged, and new ones are only added for helper clarity.

Let me know if you want a pure Pandas version or more numpy tricks!

Correctness verification report:

Test Status
⏪ Replay Tests 🔘 None Found
⚙️ Existing Unit Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
🌀 Generated Regression Tests 47 Passed
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pandas as pd
# imports
import pytest  # used for our unit tests
from langflow.api.v1.knowledge_bases import calculate_text_metrics

# -------------------- UNIT TESTS -------------------- #

# 1. BASIC TEST CASES

def test_single_column_single_row():
    # Single column, single row, simple text
    df = pd.DataFrame({'text': ['Hello world!']})
    words, chars = calculate_text_metrics(df, ['text'])

def test_single_column_multiple_rows():
    # Single column, multiple rows, simple texts
    df = pd.DataFrame({'text': ['The quick', 'brown fox', 'jumps']})
    words, chars = calculate_text_metrics(df, ['text'])

def test_multiple_columns():
    # Multiple columns, all included in text_columns
    df = pd.DataFrame({
        'a': ['hi there', 'bye'],
        'b': ['foo bar', 'baz']
    })
    words, chars = calculate_text_metrics(df, ['a', 'b'])
    expected_words = 2 + 1 + 2 + 1  # "hi" "there" | "bye" | "foo" "bar" | "baz"
    expected_chars = sum(len(s) for s in ['hi there', 'bye', 'foo bar', 'baz'])

def test_non_text_column_ignored():
    # Non-text columns should not affect result
    df = pd.DataFrame({
        'text': ['abc def'],
        'value': [123]
    })
    words, chars = calculate_text_metrics(df, ['text'])

def test_missing_text_column():
    # Specified text column not present in DataFrame
    df = pd.DataFrame({'a': ['one two']})
    words, chars = calculate_text_metrics(df, ['b'])

def test_empty_dataframe():
    # Empty DataFrame
    df = pd.DataFrame({'text': []})
    words, chars = calculate_text_metrics(df, ['text'])

def test_empty_text_columns():
    # No text columns specified
    df = pd.DataFrame({'text': ['abc def']})
    words, chars = calculate_text_metrics(df, [])

def test_multiple_columns_some_missing():
    # Some columns present, some missing
    df = pd.DataFrame({'a': ['foo bar'], 'c': ['baz']})
    words, chars = calculate_text_metrics(df, ['a', 'b', 'c'])
    expected_words = 2 + 1
    expected_chars = len('foo bar') + len('baz')

# 2. EDGE TEST CASES

def test_column_with_nan_values():
    # Column contains NaN values
    df = pd.DataFrame({'text': ['hello', None, 'world', float('nan')]})
    words, chars = calculate_text_metrics(df, ['text'])

def test_column_with_empty_strings():
    # Column contains empty strings
    df = pd.DataFrame({'text': ['', ' ', '   ', 'word']})
    words, chars = calculate_text_metrics(df, ['text'])

def test_column_with_only_whitespace():
    # Only whitespace in all rows
    df = pd.DataFrame({'text': [' ', '   ', '\t', '\n']})
    words, chars = calculate_text_metrics(df, ['text'])

def test_column_with_punctuation():
    # Text with punctuation and special characters
    df = pd.DataFrame({'text': ['Hello, world!', 'Goodbye...']})
    words, chars = calculate_text_metrics(df, ['text'])

def test_column_with_numbers_and_symbols():
    # Text with numbers and symbols
    df = pd.DataFrame({'text': ['abc123 !@#', '456 789']})
    words, chars = calculate_text_metrics(df, ['text'])

def test_non_string_types_in_column():
    # Column contains non-string types (ints, floats, bools)
    df = pd.DataFrame({'text': [123, 4.56, True, None]})
    words, chars = calculate_text_metrics(df, ['text'])

def test_column_with_multiline_strings():
    # Multiline strings
    df = pd.DataFrame({'text': ['hello\nworld', 'foo\nbar baz']})
    words, chars = calculate_text_metrics(df, ['text'])

def test_column_with_unicode_characters():
    # Unicode and emoji
    df = pd.DataFrame({'text': ['こんにちは 世界', '😊👍']})
    words, chars = calculate_text_metrics(df, ['text'])

def test_column_with_long_word():
    # A single very long word
    long_word = 'a' * 100
    df = pd.DataFrame({'text': [long_word]})
    words, chars = calculate_text_metrics(df, ['text'])

def test_column_with_leading_trailing_spaces():
    # Text with leading/trailing/multiple spaces
    df = pd.DataFrame({'text': ['  hello   world  ', '   foo   bar']})
    words, chars = calculate_text_metrics(df, ['text'])

# 3. LARGE SCALE TEST CASES

def test_large_number_of_rows():
    # 1000 rows, each with a simple sentence
    n = 1000
    df = pd.DataFrame({'text': ['word1 word2 word3'] * n})
    words, chars = calculate_text_metrics(df, ['text'])

def test_large_number_of_columns():
    # 50 columns, 20 rows, each cell with one word
    n_cols = 50
    n_rows = 20
    data = {f'col{i}': ['word'] * n_rows for i in range(n_cols)}
    df = pd.DataFrame(data)
    words, chars = calculate_text_metrics(df, list(df.columns))

def test_large_mixed_content():
    # 500 rows, 3 columns: text, numbers, empty
    n = 500
    df = pd.DataFrame({
        'text': ['foo bar baz'] * n,
        'numbers': [str(i) for i in range(n)],
        'empty': [''] * n
    })
    words, chars = calculate_text_metrics(df, ['text', 'numbers', 'empty'])
    # 'foo bar baz' -> 3 words, len=11
    # numbers -> 1 word per row, len varies
    # empty -> 0 word, 0 char per row
    expected_words = 3 * n + n
    expected_chars = 11 * n + sum(len(str(i)) for i in range(n))

def test_large_with_nans_and_empty():
    # 100 rows, some NaN, some empty, some text
    n = 100
    df = pd.DataFrame({
        'a': ['foo bar'] * (n // 2) + [None] * (n // 4) + [''] * (n - (n // 2) - (n // 4)),
        'b': [None] * n
    })
    words, chars = calculate_text_metrics(df, ['a', 'b'])
    # 'foo bar' -> 2 words, len=7
    # None -> 0 word, 0 char
    # '' -> 0 word, 0 char
    expected_words = 2 * (n // 2)
    expected_chars = 7 * (n // 2)

def test_large_column_with_long_strings():
    # 100 rows, each with a string of length 100
    n = 100
    long_str = 'x' * 100
    df = pd.DataFrame({'text': [long_str] * n})
    words, chars = calculate_text_metrics(df, ['text'])
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import pandas as pd
# imports
import pytest  # used for our unit tests
from langflow.api.v1.knowledge_bases import calculate_text_metrics

# unit tests

# ---------------- BASIC TEST CASES ----------------

def test_single_column_single_row():
    # Test with a single column and single row
    df = pd.DataFrame({'text': ['Hello world!']})
    # "Hello world!" -> 2 words, 12 characters
    codeflash_output = calculate_text_metrics(df, ['text'])

def test_single_column_multiple_rows():
    # Test with a single column and multiple rows
    df = pd.DataFrame({'text': ['Hello world!', 'This is a test.', '']})
    # Row 1: 2 words, 12 chars
    # Row 2: 4 words, 15 chars
    # Row 3: 0 words, 0 chars
    # Total: 6 words, 27 chars
    codeflash_output = calculate_text_metrics(df, ['text'])

def test_multiple_columns():
    # Test with multiple text columns
    df = pd.DataFrame({
        'title': ['Hello', 'Bye'],
        'body': ['World!', 'Everyone.']
    })
    # title: 1+1 words, 5+3 chars
    # body: 1+1 words, 6+9 chars
    # Total: 4 words, 5+3+6+9 = 23 chars
    codeflash_output = calculate_text_metrics(df, ['title', 'body'])

def test_non_string_types():
    # Test with numbers and None in the text column
    df = pd.DataFrame({'text': [123, None, 'abc def']})
    # 123 -> '123' (1 word, 3 chars)
    # None -> '' (0 word, 0 chars)
    # 'abc def' -> 2 words, 7 chars
    codeflash_output = calculate_text_metrics(df, ['text'])

def test_ignore_nonexistent_column():
    # Test with a column name that doesn't exist
    df = pd.DataFrame({'text': ['abc']})
    # Should ignore 'not_a_column'
    codeflash_output = calculate_text_metrics(df, ['text', 'not_a_column'])

def test_empty_dataframe():
    # Test with an empty DataFrame
    df = pd.DataFrame({'text': []})
    codeflash_output = calculate_text_metrics(df, ['text'])

def test_empty_text_columns():
    # Test with empty text_columns list
    df = pd.DataFrame({'text': ['abc def']})
    codeflash_output = calculate_text_metrics(df, [])

def test_no_text_columns_in_df():
    # Test with text_columns that are not in DataFrame
    df = pd.DataFrame({'foo': ['bar']})
    codeflash_output = calculate_text_metrics(df, ['baz'])

# ---------------- EDGE TEST CASES ----------------

def test_all_nan_values():
    # All values are NaN
    df = pd.DataFrame({'text': [None, None, None]})
    codeflash_output = calculate_text_metrics(df, ['text'])

def test_mixed_types_and_nans():
    # Mixture of str, int, float, None
    df = pd.DataFrame({'text': ['abc', 123, None, 4.56, 'def ghi']})
    # 'abc' -> 1 word, 3 chars
    # 123 -> '123' -> 1 word, 3 chars
    # None -> '' -> 0 word, 0 chars
    # 4.56 -> '4.56' -> 1 word, 4 chars
    # 'def ghi' -> 2 words, 7 chars
    codeflash_output = calculate_text_metrics(df, ['text'])

def test_column_with_only_spaces():
    # String with only whitespace
    df = pd.DataFrame({'text': ['   ', 'a b', '', '   ']})
    # '   ' -> 0 words, 3 chars
    # 'a b' -> 2 words, 3 chars
    # '' -> 0 words, 0 chars
    # '   ' -> 0 words, 3 chars
    codeflash_output = calculate_text_metrics(df, ['text'])

def test_column_with_tabs_and_newlines():
    # Test with tabs and newlines
    df = pd.DataFrame({'text': ['a\tb\nc', 'd\te', '\n\n', '\t']})
    # 'a\tb\nc' -> 'a', 'b', 'c' (3 words), 5 chars
    # 'd\te' -> 'd', 'e' (2 words), 3 chars
    # '\n\n' -> 0 words, 2 chars
    # '\t' -> 0 words, 1 char
    codeflash_output = calculate_text_metrics(df, ['text'])

def test_column_with_punctuation():
    # Test with punctuation
    df = pd.DataFrame({'text': ['Hello, world!', 'Well-done.', "It's fine."]})
    # 'Hello, world!' -> 2 words, 13 chars
    # 'Well-done.' -> 1 word, 10 chars
    # "It's fine." -> 2 words, 10 chars
    codeflash_output = calculate_text_metrics(df, ['text'])

def test_unicode_and_emojis():
    # Test with unicode and emojis
    df = pd.DataFrame({'text': ['こんにちは', '😀😃😄', 'word 😀']})
    # 'こんにちは' -> 1 word, 5 chars
    # '😀😃😄' -> 1 word, 3 chars
    # 'word 😀' -> 2 words, 6 chars
    codeflash_output = calculate_text_metrics(df, ['text'])

def test_column_with_leading_trailing_spaces():
    # Test with leading/trailing/multiple spaces
    df = pd.DataFrame({'text': ['  a  b  c  ', '   ', 'd e']})
    # '  a  b  c  ' -> 3 words, 11 chars
    # '   ' -> 0 words, 3 chars
    # 'd e' -> 2 words, 3 chars
    codeflash_output = calculate_text_metrics(df, ['text'])

def test_multiple_columns_some_missing():
    # Some columns exist, some don't
    df = pd.DataFrame({'col1': ['a b', 'c'], 'col2': ['d', 'e f g']})
    # col1: 'a b' (2w,3c), 'c' (1w,1c)
    # col2: 'd' (1w,1c), 'e f g' (3w,5c)
    # col3 missing
    # total: 2+1+1+3=7 words, 3+1+1+5=10 chars
    codeflash_output = calculate_text_metrics(df, ['col1', 'col2', 'col3'])

def test_column_with_empty_strings_and_none():
    # Mix of empty strings and None
    df = pd.DataFrame({'text': ['', None, ' ', 'a']})
    # '' -> 0w,0c; None->0w,0c; ' '->0w,1c; 'a'->1w,1c
    codeflash_output = calculate_text_metrics(df, ['text'])

def test_column_with_long_string_and_no_spaces():
    # Long string, no spaces
    long_str = 'a'*500
    df = pd.DataFrame({'text': [long_str]})
    # 1 word, 500 chars
    codeflash_output = calculate_text_metrics(df, ['text'])

# ---------------- LARGE SCALE TEST CASES ----------------

def test_large_dataframe_single_column():
    # 1000 rows, each with 'word word'
    df = pd.DataFrame({'text': ['word word']*1000})
    # Each row: 2 words, 9 chars; total: 2000 words, 9000 chars
    codeflash_output = calculate_text_metrics(df, ['text'])

def test_large_dataframe_multiple_columns():
    # 500 rows, 2 columns
    df = pd.DataFrame({
        'col1': ['a b c']*500,
        'col2': ['d e']*500
    })
    # col1: 3w,5c per row; col2: 2w,3c per row
    # total words: (3+2)*500=2500; chars: (5+3)*500=4000
    codeflash_output = calculate_text_metrics(df, ['col1', 'col2'])

def test_large_dataframe_with_missing_and_nonexistent_columns():
    # 1000 rows, 2 columns, one missing
    df = pd.DataFrame({'col1': ['x y']*1000})
    # col1: 2w,3c per row; col2 missing
    # total: 2000 words, 3000 chars
    codeflash_output = calculate_text_metrics(df, ['col1', 'col2'])

def test_large_dataframe_with_varied_content():
    # 1000 rows, alternating between 'a b', '', None, 'c'
    data = ['a b', '', None, 'c'] * 250  # 1000 rows
    df = pd.DataFrame({'text': data})
    # 'a b' -> 2w,3c; ''->0w,0c; None->0w,0c; 'c'->1w,1c
    # 250 of each: (2*250)+(0*250)+(0*250)+(1*250)=750 words
    # (3*250)+(0*250)+(0*250)+(1*250)=1000 chars
    codeflash_output = calculate_text_metrics(df, ['text'])

def test_large_dataframe_all_empty():
    # 1000 rows, all empty strings
    df = pd.DataFrame({'text': ['']*1000})
    codeflash_output = calculate_text_metrics(df, ['text'])

def test_large_dataframe_all_none():
    # 1000 rows, all None
    df = pd.DataFrame({'text': [None]*1000})
    codeflash_output = calculate_text_metrics(df, ['text'])
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr9088-2025-08-01T19.42.14 and push.

Codeflash

deon-sanchez and others added 30 commits July 16, 2025 14:15
…across components

- Updated import statements to use consistent single quotes.
- Refactored various components to enhance readability and maintainability.
- Adjusted folder and file handling logic in the sidebar and file manager components.
- Introduced a new tabbed interface for the files page to separate files and knowledge bases, improving user experience.
- Added a new FilesPage component to manage file uploads and organization.
- Implemented a tabbed interface to separate Files and Knowledge Bases for improved user experience.
- Created FilesTab and KnowledgeBasesTab components for handling respective functionalities.
- Refactored routing to accommodate the new structure and updated import statements for consistency.
- Removed the old filesPage component to streamline the codebase.
…mponents. Adjust tab handling in the assets page to reflect URL changes and improve user navigation experience.
…BaseSelectionOverlay components. Refactor KnowledgeBasesTab to utilize new components and improve UI for knowledge base management. Introduce utility functions for formatting numbers and average chunk sizes.
deon-sanchez and others added 21 commits July 29, 2025 15:06
- Renamed functions and variables to improve clarity regarding single-toggle columns (Vectorize and Identifier).
- Updated logic to ensure proper editability checks for single-toggle columns.
- Adjusted related components to reflect changes in column handling and rendering.
Replaces the hardcoded knowledge base directory path with a value from the settings service. This improves configurability and centralizes directory management.
- Changed expected title text from "My Files" to "Files" for accuracy.
- Removed unnecessary parentheses in arrow functions for cleaner syntax.
- Updated test assertions to ensure visibility checks are clear and consistent.
- Improved readability by standardizing the formatting of test cases.
- Changed expected title text from "My Files" to "Files" to reflect the correct page title.
…eat-knowledge-bases`)

Here’s an optimized rewrite preserving function name, parameters, and documented behavior. The biggest bottleneck is repeatedly converting columns to string and splitting using `str.split()`, both of which are slow in Pandas for large DataFrames.  
You can **avoid overhead from `astype(str)` and `str.split`** by using NumPy vectorization directly, operating on the underlying array, with fallbacks for object-dtype columns.  
I’ll also **check column existence in batch** for small performance gain, and limit to a single `astype(str)` and `.fillna("")` per column.  
Here’s the optimized code.



### Key Optimizations.
- **Uses `np.char.count` for word boundary counting** (count spaces + 1 for non-empty).  
- **Operates on columns only once** (avoids repeated `astype(str)` or `fillna`) per column.
- Handles all dtypes: vectorized calculation for string types, fast fallback for object dtype.
- **Reduces per-row Python overhead** to the unavoidable minimum.

### Performance
On wide and/or long DataFrames, this will **dramatically outperform** chained Pandas string `.str.split()` and repeated type conversions.  
The results remain *exactly the same* as before.  
All comments and docstrings for original public APIs are unchanged, and new ones are only added for helper clarity.

Let me know if you want a pure Pandas version or more numpy tricks!
@codeflash-ai codeflash-ai Bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Aug 1, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 1, 2025

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Join our Discord community for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Aug 1, 2025

Base automatically changed from feat-knowledge-bases to main August 13, 2025 20:39
@codeflash-ai codeflash-ai Bot closed this Aug 14, 2025
@codeflash-ai
Copy link
Copy Markdown
Contributor Author

codeflash-ai Bot commented Aug 14, 2025

This PR has been automatically closed because the original PR #9388 by zhangsichu was closed.

@codeflash-ai codeflash-ai Bot deleted the codeflash/optimize-pr9088-2025-08-01T19.42.14 branch August 14, 2025 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants