Skip to content

⚡️ Speed up function _get_distribution_version by 358% in PR #9192 (add-deps-metadata)#9531

Closed
codeflash-ai[bot] wants to merge 20 commits into
mainfrom
codeflash/optimize-pr9192-2025-08-25T23.06.57
Closed

⚡️ Speed up function _get_distribution_version by 358% in PR #9192 (add-deps-metadata)#9531
codeflash-ai[bot] wants to merge 20 commits into
mainfrom
codeflash/optimize-pr9192-2025-08-25T23.06.57

Conversation

@codeflash-ai
Copy link
Copy Markdown
Contributor

@codeflash-ai codeflash-ai Bot commented Aug 25, 2025

⚡️ This pull request contains optimizations for PR #9192

If you approve this dependent PR, these changes will be merged into the original PR branch add-deps-metadata.

This PR will be automatically closed if the original PR is merged.


📄 358% (3.58x) speedup for _get_distribution_version in langflow/custom/dependency_analyzer.py

⏱️ Runtime : 2.60 milliseconds 567 microseconds (best of 5 runs)

📝 Explanation and details

The optimized code introduces a two-tier caching strategy that significantly reduces expensive metadata lookups. The key optimization is adding a new cached function _get_distribution_version_by_distname() that caches version lookups by distribution name.

What changed:

  • Added _get_distribution_version_by_distname() with its own @lru_cache(maxsize=128) to cache md.distribution(dist_name).version calls
  • Modified _get_distribution_version() to call this new cached function instead of directly accessing md.distribution()

Why this is faster:
The original code had a performance bottleneck where multiple import names mapping to the same distribution would repeatedly call the expensive md.distribution(dist_name).version operation. For example, if both "PIL" and "pillow" map to the "Pillow" distribution, the original code would fetch the version twice.

The optimization eliminates this redundancy by caching at the distribution level. Once a distribution's version is cached, any import name that maps to that same distribution gets the version instantly from cache rather than hitting the filesystem.

Performance characteristics:

  • 358% speedup (2.60ms → 567μs) demonstrates the significant impact of eliminating redundant metadata calls
  • Most effective for workloads with multiple import names mapping to the same distributions
  • The test cases show this optimization particularly benefits scenarios with repeated lookups and large-scale operations where the same distributions are queried multiple times through different import names

The caching hierarchy now works as: import_name → distribution_name → cached_version, making repeated distribution version lookups nearly instantaneous.

Correctness verification report:

Test Status
⏪ Replay Tests 🔘 None Found
⚙️ Existing Unit Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
🌀 Generated Regression Tests 1264 Passed
📊 Tests Coverage 69.2%
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

import importlib.metadata as md
import sys
from functools import lru_cache

# imports
import pytest  # used for our unit tests
from langflow.custom.dependency_analyzer import _get_distribution_version

# unit tests

# Helper to get a known installed package and its version
def _get_known_package_and_version():
    # We'll use 'pytest' if available, else fallback to 'pip', else 'setuptools'
    for pkg in ('pytest', 'pip', 'setuptools'):
        try:
            version = md.version(pkg)
            # Find at least one import name for this distribution
            for import_name, dists in md.packages_distributions().items():
                if pkg in dists:
                    return import_name, version, pkg
        except md.PackageNotFoundError:
            continue
    pytest.skip("No known package found for testing")

# 1. Basic Test Cases

def test_known_package_returns_correct_version():
    """Test that a known installed package returns the correct version."""
    import_name, expected_version, dist_name = _get_known_package_and_version()
    codeflash_output = _get_distribution_version(import_name); result = codeflash_output

def test_known_package_returns_string_version():
    """Test that the version returned is a non-empty string."""
    import_name, expected_version, dist_name = _get_known_package_and_version()
    codeflash_output = _get_distribution_version(import_name); result = codeflash_output

def test_nonexistent_import_name_returns_none():
    """Test that a completely unknown import name returns None."""
    codeflash_output = _get_distribution_version("this_import_name_should_not_exist_12345"); result = codeflash_output


def test_empty_string_import_name():
    """Test that an empty string import name returns None."""
    codeflash_output = _get_distribution_version(""); result = codeflash_output

def test_import_name_with_special_characters():
    """Test that a weird import name with special characters returns None."""
    codeflash_output = _get_distribution_version("!@#$%^&*()"); result = codeflash_output



def test_distribution_removed_after_cache():
    """Test behavior if a distribution is uninstalled after being cached."""
    # This is hard to simulate without modifying the environment.
    # Instead, we simulate by forcibly clearing the cache and checking for a known missing import.
    _get_distribution_version.cache_clear()
    codeflash_output = _get_distribution_version("definitely_not_installed_987654321"); result = codeflash_output

def test_case_sensitivity_of_import_name():
    """Test that import names are case sensitive (Python import system is)."""
    import_name, expected_version, dist_name = _get_known_package_and_version()
    # Change case
    altered_name = import_name.upper() if import_name.lower() != import_name else import_name.capitalize()
    if altered_name != import_name:
        codeflash_output = _get_distribution_version(altered_name); result = codeflash_output
    else:
        pytest.skip("Import name is already uppercase or capitalized, cannot test case sensitivity.")


def test_import_name_with_leading_trailing_spaces():
    """Test import names with leading/trailing whitespace."""
    import_name, expected_version, dist_name = _get_known_package_and_version()
    codeflash_output = _get_distribution_version(" " + import_name + " "); result = codeflash_output

# 3. Large Scale Test Cases


def test_cache_behavior_large_scale():
    """Test that repeated calls with the same import name are cached and consistent."""
    import_name, expected_version, dist_name = _get_known_package_and_version()
    for _ in range(100):
        codeflash_output = _get_distribution_version(import_name); result = codeflash_output

def test_large_number_of_nonexistent_import_names():
    """Test that many nonexistent import names all return None efficiently."""
    for i in range(1000):
        name = f"nonexistent_import_{i}_abcdef"
        codeflash_output = _get_distribution_version(name); result = codeflash_output


#------------------------------------------------
from __future__ import annotations

import importlib
import importlib.metadata as md
import sys
from functools import lru_cache

# imports
import pytest  # used for our unit tests
from langflow.custom.dependency_analyzer import _get_distribution_version

# unit tests

# ----------- BASIC TEST CASES -----------

def test_known_standard_library_module_returns_none():
    # Standard library modules (not installed via pip) should return None
    codeflash_output = _get_distribution_version("sys")
    codeflash_output = _get_distribution_version("os")

def test_known_installed_package_returns_version():
    # Test with a package that is almost always installed (pytest itself)
    codeflash_output = _get_distribution_version("pytest"); version = codeflash_output
    # If pytest is not installed as a distribution, skip the test
    if version is None:
        pytest.skip("pytest is not installed as a distribution")

def test_known_installed_package_alias():
    # Some packages have different import and distribution names, e.g. 'PIL' for 'Pillow'
    # Try to detect if Pillow is installed
    codeflash_output = _get_distribution_version("PIL"); version = codeflash_output
    if version is None:
        pytest.skip("Pillow (PIL) is not installed")

def test_nonexistent_import_name_returns_none():
    # Should return None for a nonsense import name
    codeflash_output = _get_distribution_version("this_package_does_not_exist_12345")

# ----------- EDGE TEST CASES -----------

def test_empty_string_import_name_returns_none():
    # Should return None for empty string
    codeflash_output = _get_distribution_version("")

def test_import_name_with_special_characters_returns_none():
    # Import names can't have spaces or special chars
    codeflash_output = _get_distribution_version("foo bar")
    codeflash_output = _get_distribution_version("foo-bar")
    codeflash_output = _get_distribution_version("foo/bar")
    codeflash_output = _get_distribution_version("foo@bar")

def test_case_sensitivity():
    # Import names are case-sensitive
    # Try with a common package in wrong case
    codeflash_output = _get_distribution_version("pytest"); version_lower = codeflash_output
    codeflash_output = _get_distribution_version("PyTest"); version_upper = codeflash_output
    # If lower-case version exists, upper-case should not
    if version_lower is not None:
        pass

def test_builtin_module_returns_none():
    # Built-in modules like 'builtins' should return None
    codeflash_output = _get_distribution_version("builtins")

def test_submodule_name_returns_correct_version():
    # Some packages have submodules, e.g. 'pytest' has 'pytest.mark'
    # Should return the version of the parent distribution if submodule is mapped
    # This is not always present, but let's check for common ones
    codeflash_output = _get_distribution_version("pytest.mark"); version = codeflash_output
    if version is not None:
        pass




def test_large_number_of_fake_distributions(monkeypatch):
    # Simulate a large number of distributions and test lookup performance
    # We'll monkeypatch _get_packages_distributions to return a large dict
    fake_reverse_map = {f"mod{i}": [f"dist{i}"] for i in range(1000)}
    def fake_packages_distributions():
        return fake_reverse_map
    monkeypatch.setattr(
        "__main__._get_packages_distributions",
        lru_cache(maxsize=1)(fake_packages_distributions),
        raising=False
    )
    class DummyDist:
        def __init__(self, version):
            self.version = version
    orig_md_distribution = md.distribution
    md.distribution = lambda name: DummyDist(f"v{name[-3:]}")
    # Test random lookups
    for i in range(0, 1000, 100):
        mod = f"mod{i}"
        expected_version = f"v{str(i).zfill(3)}"
        codeflash_output = _get_distribution_version(mod)
    # Test a missing module
    codeflash_output = _get_distribution_version("not_a_real_mod")
    md.distribution = orig_md_distribution

def test_caching_behavior(monkeypatch):
    # Test that repeated calls are fast and return the same result
    # We'll monkeypatch to count calls
    call_count = {"count": 0}
    def fake_packages_distributions():
        call_count["count"] += 1
        return {"foo": ["foodist"]}
    monkeypatch.setattr(
        "__main__._get_packages_distributions",
        lru_cache(maxsize=1)(fake_packages_distributions),
        raising=False
    )
    class DummyDist:
        version = "9.9.9"
    orig_md_distribution = md.distribution
    md.distribution = lambda name: DummyDist()
    # First call
    codeflash_output = _get_distribution_version("foo")
    # Second call should use cache
    codeflash_output = _get_distribution_version("foo")
    md.distribution = orig_md_distribution

def test_many_unique_import_names(monkeypatch):
    # Test that the lru_cache does not break when many unique names are queried
    fake_reverse_map = {f"modx{i}": [f"distx{i}"] for i in range(128)}
    def fake_packages_distributions():
        return fake_reverse_map
    monkeypatch.setattr(
        "__main__._get_packages_distributions",
        lru_cache(maxsize=1)(fake_packages_distributions),
        raising=False
    )
    class DummyDist:
        def __init__(self, version):
            self.version = version
    orig_md_distribution = md.distribution
    md.distribution = lambda name: DummyDist(f"v{name[-3:]}")
    # Query 128 unique names (equal to cache size)
    for i in range(128):
        mod = f"modx{i}"
        expected_version = f"v{str(i).zfill(3)}"
        codeflash_output = _get_distribution_version(mod)
    # Query one more to test cache eviction policy (should still work)
    codeflash_output = _get_distribution_version("not_in_map")
    md.distribution = orig_md_distribution
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr9192-2025-08-25T23.06.57 and push.

Codeflash

ogabrielluiz and others added 19 commits July 25, 2025 12:38
- Introduced `dependency_analyzer.py` to analyze and classify dependencies in Python code.
- Implemented functions to extract import information and categorize dependencies as standard library, local, or external.
- Enhanced `build_component_metadata` to include dependency analysis results in component metadata.
- Added unit tests to validate the functionality of the dependency analysis features.
…local imports

- Updated `dependency_analyzer.py` to focus on external dependencies only, removing standard library and local imports from analysis results.
- Simplified the `DependencyInfo` class by eliminating unnecessary attributes and adjusting the deduplication logic.
- Modified `build_component_metadata` to reflect changes in dependency structure, removing counts for stdlib and local dependencies.
- Enhanced unit tests to validate the new filtering behavior and ensure no duplicates in external dependencies.
- Added dependency sections to multiple starter project JSON files, specifying required packages and their versions.
- Included `langflow` version `1.5.0.post1` and other relevant dependencies such as `orjson`, `fastapi`, and `pydantic` across various projects.
- Enhanced project metadata to improve clarity on external dependencies for better maintainability and user guidance.
…dd-deps-metadata`) (#9193)

Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
…ibution_version`

- Updated `_get_distribution_version` function to return the distribution version after successfully retrieving it, addressing a potential issue where `None` could be returned prematurely.
* fix: add security warning for overriding code field in tweaks

* test: add tests for preventing code field overrides in tweaks
* Refactor vectorstore components structure

Moved vectorstore components for Chroma, ClickHouse, Couchbase, DataStax, Elastic, Milvus, MongoDB, Pinecone, Qdrant, Supabase, Upstash, Vectara, and Weaviate into dedicated subfolders with __init__.py files for each. Updated Redis vectorstore implementation to reside in redis.py and removed the old vectorstores/redis.py. Adjusted starter project JSONs and frontend constants to reflect new module paths and sidebar entries for these vectorstores.

* Refactor vectorstore components and add lazy imports

Moved Datastax-related files from vectorstores to a dedicated datastax directory. Added lazy import logic to __init__.py files for chroma, clickhouse, couchbase, elastic, milvus, mongodb, pinecone, qdrant, supabase, upstash, vectara, and weaviate components. Cleaned up vectorstores/__init__.py to only include local and faiss components, improving modularity and import efficiency.

* [autofix.ci] apply automated fixes

* Refactor vectorstore components structure

Moved FAISS, Cassandra, and pgvector components to dedicated subdirectories with lazy-loading __init__.py files. Updated imports and references throughout the backend and frontend to reflect new locations. Removed obsolete datastax Cassandra component. Added new sidebar bundle entries for FAISS, Cassandra, and pgvector in frontend constants and style utilities.

* Add lazy imports and Redis chat memory component

Refactored the Redis module to support lazy imports for RedisIndexChatMemory and RedisVectorStoreComponent, improving import efficiency. Added a new redis_chat.py file implementing RedisIndexChatMemory for chat message storage and retrieval using Redis.

* Fix vector store astra imports

* Revert package lock changes

* More test fixes

* Update test_vector_store_rag.py

* Update test_dynamic_imports.py

* Update vector_store_rag.py

* Update test_dynamic_imports.py

* Refactor the cassandra chat component

* Fix frontend tests for bundle

* Mark Local DB as legacy

* Update inputComponent.spec.ts

* [autofix.ci] apply automated fixes

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Hare <ericrhare@gmail.com>
Co-authored-by: Carlos Coelho <80289056+carlosrcoelho@users.noreply.github.com>
…(`add-deps-metadata`)

The optimized code introduces a **two-tier caching strategy** that significantly reduces expensive metadata lookups. The key optimization is adding a new cached function `_get_distribution_version_by_distname()` that caches version lookups by distribution name.

**What changed:**
- Added `_get_distribution_version_by_distname()` with its own `@lru_cache(maxsize=128)` to cache `md.distribution(dist_name).version` calls
- Modified `_get_distribution_version()` to call this new cached function instead of directly accessing `md.distribution()`

**Why this is faster:**
The original code had a performance bottleneck where multiple import names mapping to the same distribution would repeatedly call the expensive `md.distribution(dist_name).version` operation. For example, if both "PIL" and "pillow" map to the "Pillow" distribution, the original code would fetch the version twice.

The optimization eliminates this redundancy by caching at the distribution level. Once a distribution's version is cached, any import name that maps to that same distribution gets the version instantly from cache rather than hitting the filesystem.

**Performance characteristics:**
- **358% speedup** (2.60ms → 567μs) demonstrates the significant impact of eliminating redundant metadata calls
- Most effective for workloads with multiple import names mapping to the same distributions
- The test cases show this optimization particularly benefits scenarios with repeated lookups and large-scale operations where the same distributions are queried multiple times through different import names

The caching hierarchy now works as: import_name → distribution_name → cached_version, making repeated distribution version lookups nearly instantaneous.
@codeflash-ai codeflash-ai Bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Aug 25, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 25, 2025

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Join our Discord community for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@sonarqubecloud
Copy link
Copy Markdown

Base automatically changed from add-deps-metadata to main August 25, 2025 23:51
@codeflash-ai codeflash-ai Bot deleted the codeflash/optimize-pr9192-2025-08-25T23.06.57 branch August 25, 2025 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants