⚡️ Speed up function sanitize_data by 239% in PR #10820 (cz/add-logs-feature)#11190
⚡️ Speed up function sanitize_data by 239% in PR #10820 (cz/add-logs-feature)#11190codeflash-ai[bot] wants to merge 24 commits into
sanitize_data by 239% in PR #10820 (cz/add-logs-feature)#11190Conversation
… into cz/add-logs-feature
… into cz/add-logs-feature
… into cz/add-logs-feature
# Conflicts: # src/lfx/src/lfx/_assets/component_index.json
The optimized code achieves a **239% speedup** (from 5.73ms to 1.69ms) by introducing **memoization** via `@cache` decorator for the `_is_sensitive_key` function.
## Key Optimization
The core change wraps `_is_sensitive_key` in a cached function `_is_sensitive_key_cached`:
```python
@cache
def _is_sensitive_key_cached(key: str) -> bool:
return _is_sensitive_key(key)
```
## Why This Works
Looking at the line profiler results, the original code spent **47.4% of total time** (12.22ms out of 25.76ms) calling `_is_sensitive_key(key)` on line that checks sensitivity. In the optimized version, this drops to **34.8%** (7.52ms out of 21.57ms) - a reduction of ~4.7ms despite being called the same number of times (4,400 hits).
The reason: `_is_sensitive_key` performs expensive operations on every call:
1. **String lowercasing**: `key.lower()`
2. **Set membership check**: `key_lower in SENSITIVE_KEY_NAMES`
3. **Regex matching**: `SENSITIVE_KEYS_PATTERN.match(key_lower)`
When sanitizing nested data structures, the **same keys appear repeatedly** across different dictionaries (e.g., "api_key", "password", "username"). Without caching, each occurrence re-executes all three operations. With `@cache`, after the first check, subsequent lookups for the same key are O(1) dictionary lookups returning the cached boolean result.
## Test Case Performance
The optimization particularly excels in test cases with:
- **Repeated key names**: `test_large_flat_dict_many_sensitive_keys` (500 dicts with same key pattern)
- **Nested structures with common keys**: `test_large_nested_structure` (100 users each with "username" and "token")
- **Batch processing**: `test_large_list_of_dicts_with_sensitive_keys` (300 dicts with identical "api_key")
For workloads with unique keys every time, the cache provides minimal benefit, but the overhead is negligible (just a hash table lookup miss).
## Impact Considerations
Since `function_references` is unavailable, we can't definitively assess hot path placement. However, given that this is a `sanitize_data` function in a database transactions model, it's likely called:
- Before logging/auditing database operations
- In API response sanitization
- During error reporting
If called in loops or high-frequency endpoints, the 239% speedup compounds significantly. The optimization is **safe** because:
1. `_is_sensitive_key` is a pure function (deterministic based on input)
2. Cache memory growth is bounded by the number of unique keys in the codebase (typically dozens, not millions)
3. No behavioral changes - same output guaranteed
|
Important Review skippedBot user detected. To trigger a single review, invoke the You can disable this status message by setting the Comment |
Codecov Report❌ Patch coverage is ❌ Your project status has failed because the head coverage (39.50%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #11190 +/- ##
==========================================
+ Coverage 33.23% 33.36% +0.12%
==========================================
Files 1394 1399 +5
Lines 66070 66232 +162
Branches 9778 9785 +7
==========================================
+ Hits 21960 22098 +138
- Misses 42983 43009 +26
+ Partials 1127 1125 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
⚡️ This pull request contains optimizations for PR #10820
If you approve this dependent PR, these changes will be merged into the original PR branch
cz/add-logs-feature.📄 239% (2.39x) speedup for
sanitize_datainsrc/backend/base/langflow/services/database/models/transactions/model.py⏱️ Runtime :
5.73 milliseconds→1.69 milliseconds(best of66runs)📝 Explanation and details
The optimized code achieves a 239% speedup (from 5.73ms to 1.69ms) by introducing memoization via
@cachedecorator for the_is_sensitive_keyfunction.Key Optimization
The core change wraps
_is_sensitive_keyin a cached function_is_sensitive_key_cached:Why This Works
Looking at the line profiler results, the original code spent 47.4% of total time (12.22ms out of 25.76ms) calling
_is_sensitive_key(key)on line that checks sensitivity. In the optimized version, this drops to 34.8% (7.52ms out of 21.57ms) - a reduction of ~4.7ms despite being called the same number of times (4,400 hits).The reason:
_is_sensitive_keyperforms expensive operations on every call:key.lower()key_lower in SENSITIVE_KEY_NAMESSENSITIVE_KEYS_PATTERN.match(key_lower)When sanitizing nested data structures, the same keys appear repeatedly across different dictionaries (e.g., "api_key", "password", "username"). Without caching, each occurrence re-executes all three operations. With
@cache, after the first check, subsequent lookups for the same key are O(1) dictionary lookups returning the cached boolean result.Test Case Performance
The optimization particularly excels in test cases with:
test_large_flat_dict_many_sensitive_keys(500 dicts with same key pattern)test_large_nested_structure(100 users each with "username" and "token")test_large_list_of_dicts_with_sensitive_keys(300 dicts with identical "api_key")For workloads with unique keys every time, the cache provides minimal benefit, but the overhead is negligible (just a hash table lookup miss).
Impact Considerations
Since
function_referencesis unavailable, we can't definitively assess hot path placement. However, given that this is asanitize_datafunction in a database transactions model, it's likely called:If called in loops or high-frequency endpoints, the 239% speedup compounds significantly. The optimization is safe because:
_is_sensitive_keyis a pure function (deterministic based on input)✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-pr10820-2026-01-05T13.41.01and push.