Skip to content

⚡️ Speed up function sanitize_filename by 11% in PR #10819 (s3-file-size-and-associations-to-flows)#10823

Closed
codeflash-ai[bot] wants to merge 2 commits into
s3-file-size-and-associations-to-flowsfrom
codeflash/optimize-pr10819-2025-12-01T20.19.10
Closed

⚡️ Speed up function sanitize_filename by 11% in PR #10819 (s3-file-size-and-associations-to-flows)#10823
codeflash-ai[bot] wants to merge 2 commits into
s3-file-size-and-associations-to-flowsfrom
codeflash/optimize-pr10819-2025-12-01T20.19.10

Conversation

@codeflash-ai
Copy link
Copy Markdown
Contributor

@codeflash-ai codeflash-ai Bot commented Dec 1, 2025

⚡️ This pull request contains optimizations for PR #10819

If you approve this dependent PR, these changes will be merged into the original PR branch s3-file-size-and-associations-to-flows.

This PR will be automatically closed if the original PR is merged.


📄 11% (0.11x) speedup for sanitize_filename in src/backend/base/langflow/api/v2/files.py

⏱️ Runtime : 1.69 milliseconds 1.52 milliseconds (best of 102 runs)

📝 Explanation and details

The optimized code achieves a 10% speedup through two key optimizations that reduce the most expensive operations in the function:

Primary Optimization - Precompiled Regex Pattern:
The regex pattern r"[^\w.\- ()]" is now precompiled as a module-level constant _DANGEROUS_CHARS_RE. The line profiler shows this dramatically reduces the regex substitution time from 1.87ms (23.8% of runtime) to 0.54ms (8.7% of runtime) - a 71% reduction in this operation's cost. Regex compilation is expensive, and since this function processes filenames repeatedly, precompiling eliminates redundant pattern compilation overhead.

Secondary Optimization - PurePath vs Path:
Replacing Path(filename).name with PurePath(filename).name provides a modest improvement. PurePath is a lightweight path manipulation class that doesn't perform filesystem operations or validation checks that Path does, making it faster for pure string manipulation tasks like extracting the filename component.

Performance Impact Analysis:
Based on the test cases, these optimizations are particularly effective for:

  • High-frequency filename processing - The precompiled regex saves compilation overhead on every call
  • Large-scale operations - Tests show consistent benefits across various filename lengths and character types
  • Mixed content scenarios - Both simple filenames and complex cases with dangerous characters benefit from the regex optimization

The optimizations maintain identical security and functionality while reducing computational overhead, making this especially valuable if the function is called frequently in file upload or processing workflows.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 178 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import re
import string  # used for generating large filenames
from pathlib import Path

# imports
import pytest  # used for our unit tests
from langflow.api.v2.files import sanitize_filename

MAX_FILENAME_LENGTH = 255
# Maximum reasonable extension length
MAX_EXTENSION_LENGTH = 20
from langflow.api.v2.files import sanitize_filename

# unit tests

# -------------------------
# Basic Test Cases
# -------------------------

def test_basic_alphanumeric_filename():
    # Should remain unchanged
    codeflash_output = sanitize_filename("file123.txt")

def test_basic_spaces_and_hyphens():
    # Spaces and hyphens are allowed
    codeflash_output = sanitize_filename("my file-name.txt")

def test_basic_underscores_and_parentheses():
    # Underscores and parentheses are allowed
    codeflash_output = sanitize_filename("data_set (v2).csv")

def test_basic_multiple_dots():
    # Multiple dots are allowed
    codeflash_output = sanitize_filename("archive.tar.gz")

def test_basic_extension_only():
    # Should sanitize to "unnamed" if only extension after stripping
    codeflash_output = sanitize_filename(".gitignore")

# -------------------------
# Edge Test Cases
# -------------------------

def test_empty_string():
    # Should return "unnamed" for empty input
    codeflash_output = sanitize_filename("")

def test_none_string():
    # Should return "unnamed" for None input
    codeflash_output = sanitize_filename(None)

def test_path_traversal_attempt():
    # Should strip path traversal and sanitize dangerous chars
    codeflash_output = sanitize_filename("../../etc/passwd")

def test_windows_path_separator():
    # Should strip Windows path and sanitize
    codeflash_output = sanitize_filename("C:\\Users\\user\\secret.txt")

def test_leading_trailing_spaces_and_dots():
    # Should strip leading/trailing spaces and dots
    codeflash_output = sanitize_filename("  .hiddenfile. ")

def test_only_dangerous_chars():
    # Should sanitize all dangerous chars to underscore, then fallback to "unnamed"
    codeflash_output = sanitize_filename("////")
    codeflash_output = sanitize_filename("<<>>")

def test_control_characters():
    # Control chars should be replaced by underscores
    codeflash_output = sanitize_filename("bad\x00file.txt")

def test_quotes_and_semicolons():
    # Quotes and semicolons become underscores
    codeflash_output = sanitize_filename("weird'file;name.txt")

def test_unicode_characters():
    # Non-ascii unicode chars should be replaced with underscores
    codeflash_output = sanitize_filename("файл.txt")
    codeflash_output = sanitize_filename("résumé.pdf")

def test_filename_with_multiple_extensions():
    # Should preserve all dots, sanitize dangerous chars
    codeflash_output = sanitize_filename("my.file.backup.tar.gz")

def test_filename_with_only_spaces():
    # Should strip spaces and fallback to "unnamed"
    codeflash_output = sanitize_filename("    ")

def test_filename_with_leading_trailing_underscores_and_dots():
    # Should only strip dots, not underscores
    codeflash_output = sanitize_filename(".__myfile__.txt.")

def test_filename_with_reserved_windows_names():
    # Should sanitize reserved names but not change them unless dangerous
    codeflash_output = sanitize_filename("CON.txt")
    codeflash_output = sanitize_filename("AUX")

def test_filename_with_no_extension():
    # Should sanitize and preserve filename
    codeflash_output = sanitize_filename("just_a_file")

def test_filename_with_long_extension():
    # Extension longer than MAX_EXTENSION_LENGTH should be truncated
    ext = "a" * (MAX_EXTENSION_LENGTH + 5)
    name = "file"
    codeflash_output = sanitize_filename(f"{name}.{ext}"); result = codeflash_output

def test_filename_with_hidden_file_dot():
    # Should strip leading dot
    codeflash_output = sanitize_filename(".hidden")

def test_filename_with_dot_only():
    # Should fallback to "unnamed"
    codeflash_output = sanitize_filename(".")

def test_filename_with_multiple_path_separators():
    # Should collapse all separators and sanitize
    codeflash_output = sanitize_filename("folder////file.txt")

def test_filename_with_non_ascii_and_spaces():
    # Should replace non-ascii, preserve spaces
    codeflash_output = sanitize_filename("résumé 2024.pdf")

# -------------------------
# Large Scale Test Cases
# -------------------------

def test_large_filename_max_length():
    # Create a filename exactly at MAX_FILENAME_LENGTH
    base = "a" * (MAX_FILENAME_LENGTH - 4)  # leave room for ".txt"
    filename = f"{base}.txt"
    codeflash_output = sanitize_filename(filename); sanitized = codeflash_output

def test_large_filename_over_max_length_with_short_extension():
    # Filename over max length, extension short enough to preserve
    base = "b" * (MAX_FILENAME_LENGTH + 50 - 4)
    filename = f"{base}.txt"
    codeflash_output = sanitize_filename(filename); sanitized = codeflash_output

def test_large_filename_over_max_length_with_long_extension():
    # Extension too long, should truncate whole filename
    ext = "c" * (MAX_EXTENSION_LENGTH + 10)
    base = "d" * (MAX_FILENAME_LENGTH + 50 - len(ext) - 1)
    filename = f"{base}.{ext}"
    codeflash_output = sanitize_filename(filename); sanitized = codeflash_output

def test_large_filename_with_dangerous_chars():
    # Large filename with many dangerous chars
    base = "e" * (MAX_FILENAME_LENGTH - 10)
    dangerous = "<>:\"/\\|?*" * 10
    filename = f"{base}{dangerous}.log"
    codeflash_output = sanitize_filename(filename); sanitized = codeflash_output

def test_large_filename_all_dangerous_chars():
    # Filename with only dangerous chars, long length
    filename = "<>:\"/\\|?*" * 40
    codeflash_output = sanitize_filename(filename); sanitized = codeflash_output

def test_large_filename_with_unicode():
    # Large filename with unicode characters
    base = "файл" * 50
    filename = f"{base}.txt"
    codeflash_output = sanitize_filename(filename); sanitized = codeflash_output

def test_large_filename_with_spaces_and_parentheses():
    # Large filename with spaces and parentheses
    base = ("(abc) " * 40).strip()
    filename = f"{base}.csv"
    codeflash_output = sanitize_filename(filename); sanitized = codeflash_output

def test_many_unique_large_filenames():
    # Test sanitizing many unique large filenames for determinism and performance
    for i in range(100):
        base = f"file_{i}_" + "x" * (MAX_FILENAME_LENGTH - 10)
        filename = f"{base}.dat"
        codeflash_output = sanitize_filename(filename); sanitized = codeflash_output

def test_filename_with_mixed_whitespace_and_dangerous_chars():
    # Mixed whitespace and dangerous chars, should sanitize correctly
    filename = "   bad<file>|name?.txt   "
    codeflash_output = sanitize_filename(filename); sanitized = codeflash_output

def test_filename_with_newlines_and_tabs():
    # Newlines and tabs should be replaced with underscores
    filename = "file\nname\t2024.doc"
    codeflash_output = sanitize_filename(filename); sanitized = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import re
import string  # used for generating test strings
from pathlib import Path

# imports
import pytest  # used for our unit tests
from langflow.api.v2.files import sanitize_filename

MAX_FILENAME_LENGTH = 255
MAX_EXTENSION_LENGTH = 20
from langflow.api.v2.files import sanitize_filename

# unit tests

# --------------------------
# Basic Test Cases
# --------------------------

def test_basic_alphanumeric_filename():
    # Should remain unchanged
    codeflash_output = sanitize_filename("myfile.txt")

def test_basic_filename_with_spaces():
    # Spaces are allowed
    codeflash_output = sanitize_filename("my file.txt")

def test_basic_filename_with_underscore_and_dash():
    # Underscores and dashes are allowed
    codeflash_output = sanitize_filename("my_file-name.txt")

def test_basic_filename_with_parentheses():
    # Parentheses are allowed
    codeflash_output = sanitize_filename("report (final).pdf")

def test_basic_filename_with_multiple_dots():
    # Multiple dots are allowed
    codeflash_output = sanitize_filename("archive.tar.gz")

def test_basic_filename_with_uppercase():
    # Case should be preserved
    codeflash_output = sanitize_filename("Photo.JPG")

# --------------------------
# Edge Test Cases
# --------------------------

def test_empty_string_returns_unnamed():
    # Empty string should return "unnamed"
    codeflash_output = sanitize_filename("")

def test_none_string_returns_unnamed():
    # None should return "unnamed"
    codeflash_output = sanitize_filename(None)

def test_filename_with_only_invalid_characters():
    # All invalid chars should be replaced, then stripped, then fallback to "unnamed"
    codeflash_output = sanitize_filename("$$")

def test_filename_with_leading_and_trailing_spaces_and_dots():
    # Leading/trailing spaces and dots are stripped
    codeflash_output = sanitize_filename("   .hiddenfile.   ")

def test_filename_with_path_traversal():
    # Path traversal should be stripped
    codeflash_output = sanitize_filename("../../etc/passwd")

def test_filename_with_windows_path():
    # Windows path separators should be stripped
    codeflash_output = sanitize_filename("C:\\Users\\user\\Desktop\\file.txt")

def test_filename_with_mixed_separators():
    # Mixed separators should be stripped
    codeflash_output = sanitize_filename("folder/subfolder\\file.doc")

def test_filename_with_control_characters():
    # Control characters should be replaced with underscores
    codeflash_output = sanitize_filename("my\x00file.txt")

def test_filename_with_quotes_and_semicolon():
    # Quotes and semicolons replaced with underscores
    codeflash_output = sanitize_filename('my"file;name.txt')

def test_filename_with_unicode_characters():
    # Non-ASCII unicode replaced with underscores
    codeflash_output = sanitize_filename("résumé.pdf")

def test_filename_with_only_dots():
    # Only dots should fallback to "unnamed"
    codeflash_output = sanitize_filename("...")

def test_filename_with_only_spaces():
    # Only spaces should fallback to "unnamed"
    codeflash_output = sanitize_filename("     ")

def test_filename_with_long_extension():
    # Extension longer than MAX_EXTENSION_LENGTH is truncated
    ext = "a" * (MAX_EXTENSION_LENGTH + 5)
    fname = f"file.{ext}"
    codeflash_output = sanitize_filename(fname); result = codeflash_output

def test_filename_with_short_extension_and_long_name():
    # Extension preserved, name truncated
    ext = "txt"
    name = "a" * (MAX_FILENAME_LENGTH + 10)
    fname = f"{name}.{ext}"
    codeflash_output = sanitize_filename(fname); result = codeflash_output

def test_filename_with_no_extension_and_long_name():
    # No extension, just truncate to max length
    name = "b" * (MAX_FILENAME_LENGTH + 50)
    codeflash_output = sanitize_filename(name); result = codeflash_output

def test_filename_with_dot_and_extension_length_exact_max():
    # Extension exactly MAX_EXTENSION_LENGTH, should be preserved
    ext = "e" * MAX_EXTENSION_LENGTH
    name = "x" * (MAX_FILENAME_LENGTH - MAX_EXTENSION_LENGTH - 1)
    fname = f"{name}.{ext}"
    codeflash_output = sanitize_filename(fname); result = codeflash_output

def test_filename_with_dot_and_extension_length_above_max():
    # Extension longer than MAX_EXTENSION_LENGTH, truncate whole filename
    ext = "e" * (MAX_EXTENSION_LENGTH + 1)
    name = "y" * (MAX_FILENAME_LENGTH - MAX_EXTENSION_LENGTH - 1)
    fname = f"{name}.{ext}"
    codeflash_output = sanitize_filename(fname); result = codeflash_output

def test_filename_with_multiple_dots_and_long_name():
    # Only last extension is considered
    ext = "ext"
    name = "x" * (MAX_FILENAME_LENGTH + 40)
    fname = f"{name}.tar.{ext}"
    codeflash_output = sanitize_filename(fname); result = codeflash_output

def test_filename_with_leading_dot_hidden_file():
    # Leading dot is stripped
    codeflash_output = sanitize_filename(".hidden")

def test_filename_with_trailing_dot():
    # Trailing dot is stripped
    codeflash_output = sanitize_filename("file.")

def test_filename_with_trailing_whitespace():
    # Trailing whitespace is stripped
    codeflash_output = sanitize_filename("file.txt   ")

def test_filename_with_leading_whitespace():
    # Leading whitespace is stripped
    codeflash_output = sanitize_filename("   file.txt")

def test_filename_with_multiple_consecutive_invalid_chars():
    # Multiple invalid chars replaced with underscores
    codeflash_output = sanitize_filename("file<>|*?.txt")

def test_filename_with_mixed_valid_and_invalid_chars():
    # Invalid chars replaced, valid preserved
    codeflash_output = sanitize_filename("my*file@name#2024!.txt")

def test_filename_with_newlines():
    # Newlines replaced with underscores
    codeflash_output = sanitize_filename("my\nfile.txt")

def test_filename_with_tab_character():
    # Tabs replaced with underscores
    codeflash_output = sanitize_filename("my\tfile.txt")

# --------------------------
# Large Scale Test Cases
# --------------------------

def test_large_filename_with_valid_chars():
    # Should be truncated to MAX_FILENAME_LENGTH
    fname = "a" * (MAX_FILENAME_LENGTH * 2)
    codeflash_output = sanitize_filename(fname); result = codeflash_output

def test_large_filename_with_invalid_chars():
    # Should be replaced and truncated
    fname = "$" * (MAX_FILENAME_LENGTH * 2)
    codeflash_output = sanitize_filename(fname); result = codeflash_output

def test_large_filename_with_long_extension():
    # Extension longer than MAX_EXTENSION_LENGTH, whole filename truncated
    ext = "z" * (MAX_EXTENSION_LENGTH + 50)
    name = "y" * (MAX_FILENAME_LENGTH * 2)
    fname = f"{name}.{ext}"
    codeflash_output = sanitize_filename(fname); result = codeflash_output

def test_large_filename_with_short_extension():
    # Name truncated, extension preserved
    ext = "abc"
    name = "n" * (MAX_FILENAME_LENGTH * 2)
    fname = f"{name}.{ext}"
    codeflash_output = sanitize_filename(fname); result = codeflash_output

def test_large_filename_with_mixed_valid_invalid_chars():
    # Mix of valid/invalid, replaced and truncated
    valid = "abcDEF123"
    invalid = "<>|*?/"
    fname = (valid + invalid) * 100
    codeflash_output = sanitize_filename(fname); result = codeflash_output
    # Should only contain allowed chars and underscores
    allowed = set(string.ascii_letters + string.digits + " .-_()")

def test_large_filename_with_path_components():
    # Path components stripped, only last part sanitized
    fname = "/".join([f"folder{i}" for i in range(50)]) + "/final_file.txt"
    codeflash_output = sanitize_filename(fname); result = codeflash_output

def test_large_filename_with_spaces_and_dots():
    # Spaces and dots allowed, but string truncated
    fname = (" . " * 1000) + "end.txt"
    codeflash_output = sanitize_filename(fname); result = codeflash_output

def test_large_filename_with_unicode_characters():
    # Unicode replaced with underscores, then truncated
    fname = "файл" * 300
    codeflash_output = sanitize_filename(fname); result = codeflash_output

def test_large_filename_with_multiple_dots_and_extensions():
    # Only last extension considered
    fname = "data." * 500 + "csv"
    codeflash_output = sanitize_filename(fname); result = codeflash_output

def test_large_filename_with_parentheses_and_valid_chars():
    # Parentheses preserved, string truncated
    fname = ("(abc)" * 200) + ".pdf"
    codeflash_output = sanitize_filename(fname); result = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr10819-2025-12-01T20.19.10 and push.

Codeflash

The optimized code achieves a **10% speedup** through two key optimizations that reduce the most expensive operations in the function:

**Primary Optimization - Precompiled Regex Pattern:**
The regex pattern `r"[^\w.\- ()]"` is now precompiled as a module-level constant `_DANGEROUS_CHARS_RE`. The line profiler shows this dramatically reduces the regex substitution time from 1.87ms (23.8% of runtime) to 0.54ms (8.7% of runtime) - a **71% reduction** in this operation's cost. Regex compilation is expensive, and since this function processes filenames repeatedly, precompiling eliminates redundant pattern compilation overhead.

**Secondary Optimization - PurePath vs Path:**
Replacing `Path(filename).name` with `PurePath(filename).name` provides a modest improvement. `PurePath` is a lightweight path manipulation class that doesn't perform filesystem operations or validation checks that `Path` does, making it faster for pure string manipulation tasks like extracting the filename component.

**Performance Impact Analysis:**
Based on the test cases, these optimizations are particularly effective for:
- **High-frequency filename processing** - The precompiled regex saves compilation overhead on every call
- **Large-scale operations** - Tests show consistent benefits across various filename lengths and character types
- **Mixed content scenarios** - Both simple filenames and complex cases with dangerous characters benefit from the regex optimization

The optimizations maintain identical security and functionality while reducing computational overhead, making this especially valuable if the function is called frequently in file upload or processing workflows.
@codeflash-ai codeflash-ai Bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Dec 1, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 1, 2025

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the community Pull Request from an external contributor label Dec 1, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 32.38%. Comparing base (ef63f8d) to head (a5111e6).

❌ Your project status has failed because the head coverage (40.04%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@                            Coverage Diff                             @@
##           s3-file-size-and-associations-to-flows   #10823      +/-   ##
==========================================================================
- Coverage                                   32.39%   32.38%   -0.01%     
==========================================================================
  Files                                        1367     1367              
  Lines                                       63235    63225      -10     
  Branches                                     9358     9357       -1     
==========================================================================
- Hits                                        20482    20478       -4     
+ Misses                                      41720    41714       -6     
  Partials                                     1033     1033              
Flag Coverage Δ
frontend 14.13% <ø> (ø)
lfx 40.04% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/backend/base/langflow/api/v2/files.py 59.10% <ø> (-0.15%) ⬇️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Dec 1, 2025

Frontend Unit Test Coverage Report

Coverage Summary

Lines Statements Branches Functions
Coverage: 15%
15.29% (4188/27381) 8.49% (1778/20935) 9.6% (579/6031)

Unit Test Results

Tests Skipped Failures Errors Time
1638 0 💤 0 ❌ 0 🔥 20.069s ⏱️

@ogabrielluiz
Copy link
Copy Markdown
Contributor

Closing automated codeflash PR.

@codeflash-ai codeflash-ai Bot deleted the codeflash/optimize-pr10819-2025-12-01T20.19.10 branch March 3, 2026 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI community Pull Request from an external contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant