Skip to content

perf: optimize technology detection with content caching and pre-discovery#421

Draft
yacosta738 wants to merge 1 commit into
mainfrom
perf-tech-detection-caching-9856788744421460774
Draft

perf: optimize technology detection with content caching and pre-discovery#421
yacosta738 wants to merge 1 commit into
mainfrom
perf-tech-detection-caching-9856788744421460774

Conversation

@yacosta738
Copy link
Copy Markdown
Contributor

Optimized the technology detection system by implementing content caching for shared configuration files and pre-discovering layout-specific markers (like Gradle files). This eliminates redundant filesystem I/O and heap allocations during the evaluation of detection rules, making the agentsync skill suggest command measurably faster.


PR created automatically by Jules for task 9856788744421460774 started by @yacosta738

…overy

- Implement per-project content cache in `CatalogDrivenDetector` to
  ensure shared config files (e.g. setup.py, package.json) are read from
  disk at most once.
- Pre-discover Gradle layout files during the primary `RepoMetadata`
  walk, eliminating redundant O(Rules * Files) filesystem checks and
  dynamic path constructions in the detection loop.
- Pre-compile `PathBuf`s for detection rules to avoid repeated heap
  allocations and string parsing during rule evaluation.
- Document performance learning in `.agents/journal/bolt.md`.

These optimizations move technology detection from disk-bound to
memory-bound, significantly improving the performance of the
`skill suggest` command in large or nested projects.
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 17, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Refactor
    • Improved technology detection performance through per-project content caching, eliminating repeated file reads when evaluating multiple technology rules.
    • Optimized project scanning efficiency by pre-discovering layout-specific markers during the initial scan, reducing redundant disk I/O and dynamic path construction during rule evaluation.

Walkthrough

This PR optimizes technology detection by introducing per-project content caching and pre-discovering Gradle layout markers during the initial filesystem walk. RepoMetadata now collects Gradle files, the detection pipeline maintains a shared content cache cleared between nested projects, and rule evaluation leverages cached file content to reduce repeated disk reads.

Changes

Technology Detection Caching and Gradle Pre-Discovery

Layer / File(s) Summary
Gradle file pre-discovery in metadata collection
src/skills/detect.rs
RepoMetadata adds a gradle_files field to capture Gradle-related files (build.gradle, settings.gradle) discovered at root level and selected depth-2 paths during the bounded WalkDir walk, replacing legacy root-dir tracking.
Content cache threading through detection pipeline
src/skills/detect.rs
CatalogDrivenDetector::detect initializes a per-project content_cache and passes it to both root and nested evaluate_rules() invocations; cache is cleared between nested projects to isolate config content per project.
Rule compilation and cache-aware evaluation
src/skills/detect.rs
CompiledConfigFileContentRules converts files from Option<Vec<String>> to Option<Vec<PathBuf>> at compile time. evaluate_rules() gains a content_cache parameter and implements cache-before-read: checks the cache, reads and inserts on miss, then applies configured regex patterns to cached content.
File gathering using pre-discovered gradle_files
src/skills/detect.rs
gather_content_scan_files() is updated to extend files from metadata.gradle_files directly when scan_gradle_layout is enabled, with deduplication and existence checks applied to explicit file inclusion.
Documentation
.agents/journal/bolt.md
Journal entry for 2026-05-24 documents the optimization strategy: per-project content caching and pre-discovering layout markers during the primary walk to reduce disk I/O and allocation overhead.

🎯 2 (Simple) | ⏱️ ~12 minutes

🐰 A rabbit hops through the code,
Pre-caching files on the road,
Less disk reads, more speed,
Gradle files we'll indeed,
Discover once, reuse with mode! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'perf: optimize technology detection with content caching and pre-discovery' accurately reflects the main changes: implementing content caching for shared config files and pre-discovering layout-specific markers to optimize technology detection performance.
Description check ✅ Passed The description is directly related to the changeset, explaining the implementation of content caching and pre-discovery mechanisms that correspond to the modifications in detect.rs and the journal entry.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf-tech-detection-caching-9856788744421460774

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread src/skills/detect.rs
Some(cached)
} else {
let absolute = project_root.join(file_path);
if let Ok(content) = fs::read_to_string(absolute) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semgrep identified a blocking 🔴 issue in your code:

File path constructed from untrusted configuration rules allows path traversal attack, enabling attackers to read arbitrary files outside the intended directory.

More details about this

The code reads file content from a path constructed by joining project_root with file_path, where file_path comes from untrusted content_rules that may be loaded from external configuration. An attacker can craft a malicious content_rules configuration containing path traversal sequences (like ../../../etc/passwd) in the file paths. When project_root.join(file_path) is called, these sequences allow escaping the intended directory boundary and reading arbitrary files on the system. For example, if project_root is /home/app/config and file_path is ../../../etc/passwd, the code will read /etc/passwd instead of staying within the config directory. The application then searches patterns against this file content, potentially exposing sensitive data like credentials or system information to an attacker who controls the rule configuration.

Dataflow graph
flowchart LR
    classDef invis fill:white, stroke: none
    classDef default fill:#e7f5ff, color:#1c7fd6, stroke: none

    subgraph File0["<b>src/skills/detect.rs</b>"]
        direction LR
        %% Source

        subgraph Source
            direction LR

            v0["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L335 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 335] project_root</a>"]
        end
        %% Intermediate

        subgraph Traces0[Traces]
            direction TB

            v2["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L335 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 335] metadata</a>"]

            v3["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L358 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 358] &</a>"]

            v4["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L358 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 358] in</a>"]

            v5["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L358 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 358] rel_nested_dir</a>"]

            v6["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L359 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 359] nested_dir</a>"]

            v7["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L375 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 375] &</a>"]

            v8["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L374 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 374] evaluate_rules</a>"]

            v9["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L421 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 421] project_root</a>"]

            v10["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L482 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 482] absolute</a>"]
        end
            v2 --> v3
            v3 --> v4
            v4 --> v5
            v5 --> v6
            v6 --> v7
            v7 --> v8
            v8 --> v9
            v9 --> v10
        %% Sink

        subgraph Sink
            direction LR

            v1["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L483 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 483] absolute</a>"]
        end
    end
    %% Class Assignment
    Source:::invis
    Sink:::invis

    Traces0:::invis
    File0:::invis

    %% Connections

    Source --> Traces0
    Traces0 --> Sink


Loading

To resolve this comment:

✨ Commit fix suggestion
  1. Normalize and validate the file path to prevent path traversal before using it with fs::read_to_string or similar file system functions.
  2. Use a helper function like canonical_existing_path to canonicalize and check the path. You can do this by adding: let file_path = canonical_existing_path(file_path).ok()?; before any code that reads the file, such as fs::read_to_string(file_path).
  3. Only proceed with file operations if the call to canonical_existing_path returns Some(path), meaning the path is valid and canonical.
  4. Make sure that all dynamic path construction using user input (such as joins with user-derived values) passes through this validation step before being used by the application.

Canonicalizing the path with canonical_existing_path ensures that attacker-controlled input cannot break out of the intended directory, which helps to prevent path traversal vulnerabilities.

💬 Ignore this finding

Reply with Semgrep commands to ignore this finding.

  • /fp <comment> for false positive
  • /ar <comment> for acceptable risk
  • /other <comment> for all other reasons

Alternatively, triage in Semgrep AppSec Platform to ignore the finding created by tainted-path.

You can view more details about this finding in the Semgrep AppSec Platform.

@sentry
Copy link
Copy Markdown

sentry Bot commented May 17, 2026

Codecov Report

❌ Patch coverage is 94.82759% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/skills/detect.rs 94.82% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.agents/journal/bolt.md:
- Around line 156-157: The Markdown heading "## 2026-05-24 - Content Caching and
Gradle Discovery for Tech Detection" is missing a blank line beneath it; update
the .agents/journal/bolt.md entry by inserting a single blank line immediately
after that heading to satisfy MD022 linting (ensure the heading and the
following paragraph are separated by one empty line).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 04ba79d7-eda9-431c-8b91-e3f51e173b18

📥 Commits

Reviewing files that changed from the base of the PR and between da781df and 564017c.

📒 Files selected for processing (2)
  • .agents/journal/bolt.md
  • src/skills/detect.rs

Comment thread .agents/journal/bolt.md
Comment on lines +156 to +157
## 2026-05-24 - Content Caching and Gradle Discovery for Tech Detection
**Learning:** Even with single-pass metadata, technology detection was performing redundant I/O and allocations by re-reading shared configuration files (e.g., `setup.py`) and re-calculating Gradle layout paths for every matching rule. This created an \(O(T \times F)\) overhead in I/O and heap allocations.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a blank line after the heading to satisfy Markdown linting.

Line 156 is missing the required blank line below the heading (MD022), which can keep doc-lint warnings noisy in CI.

Suggested fix
 ## 2026-05-24 - Content Caching and Gradle Discovery for Tech Detection
+
 **Learning:** Even with single-pass metadata, technology detection was performing redundant I/O and allocations by re-reading shared configuration files (e.g., `setup.py`) and re-calculating Gradle layout paths for every matching rule. This created an \(O(T \times F)\) overhead in I/O and heap allocations.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## 2026-05-24 - Content Caching and Gradle Discovery for Tech Detection
**Learning:** Even with single-pass metadata, technology detection was performing redundant I/O and allocations by re-reading shared configuration files (e.g., `setup.py`) and re-calculating Gradle layout paths for every matching rule. This created an \(O(T \times F)\) overhead in I/O and heap allocations.
## 2026-05-24 - Content Caching and Gradle Discovery for Tech Detection
**Learning:** Even with single-pass metadata, technology detection was performing redundant I/O and allocations by re-reading shared configuration files (e.g., `setup.py`) and re-calculating Gradle layout paths for every matching rule. This created an \(O(T \times F)\) overhead in I/O and heap allocations.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 156-156: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.agents/journal/bolt.md around lines 156 - 157, The Markdown heading "##
2026-05-24 - Content Caching and Gradle Discovery for Tech Detection" is missing
a blank line beneath it; update the .agents/journal/bolt.md entry by inserting a
single blank line immediately after that heading to satisfy MD022 linting
(ensure the heading and the following paragraph are separated by one empty
line).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant