perf: optimize technology detection with content caching and pre-discovery#421
perf: optimize technology detection with content caching and pre-discovery#421yacosta738 wants to merge 1 commit into
Conversation
…overy - Implement per-project content cache in `CatalogDrivenDetector` to ensure shared config files (e.g. setup.py, package.json) are read from disk at most once. - Pre-discover Gradle layout files during the primary `RepoMetadata` walk, eliminating redundant O(Rules * Files) filesystem checks and dynamic path constructions in the detection loop. - Pre-compile `PathBuf`s for detection rules to avoid repeated heap allocations and string parsing during rule evaluation. - Document performance learning in `.agents/journal/bolt.md`. These optimizations move technology detection from disk-bound to memory-bound, significantly improving the performance of the `skill suggest` command in large or nested projects.
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR optimizes technology detection by introducing per-project content caching and pre-discovering Gradle layout markers during the initial filesystem walk. RepoMetadata now collects Gradle files, the detection pipeline maintains a shared content cache cleared between nested projects, and rule evaluation leverages cached file content to reduce repeated disk reads. ChangesTechnology Detection Caching and Gradle Pre-Discovery
🎯 2 (Simple) | ⏱️ ~12 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| Some(cached) | ||
| } else { | ||
| let absolute = project_root.join(file_path); | ||
| if let Ok(content) = fs::read_to_string(absolute) { |
There was a problem hiding this comment.
Semgrep identified a blocking 🔴 issue in your code:
File path constructed from untrusted configuration rules allows path traversal attack, enabling attackers to read arbitrary files outside the intended directory.
More details about this
The code reads file content from a path constructed by joining project_root with file_path, where file_path comes from untrusted content_rules that may be loaded from external configuration. An attacker can craft a malicious content_rules configuration containing path traversal sequences (like ../../../etc/passwd) in the file paths. When project_root.join(file_path) is called, these sequences allow escaping the intended directory boundary and reading arbitrary files on the system. For example, if project_root is /home/app/config and file_path is ../../../etc/passwd, the code will read /etc/passwd instead of staying within the config directory. The application then searches patterns against this file content, potentially exposing sensitive data like credentials or system information to an attacker who controls the rule configuration.
Dataflow graph
flowchart LR
classDef invis fill:white, stroke: none
classDef default fill:#e7f5ff, color:#1c7fd6, stroke: none
subgraph File0["<b>src/skills/detect.rs</b>"]
direction LR
%% Source
subgraph Source
direction LR
v0["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L335 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 335] project_root</a>"]
end
%% Intermediate
subgraph Traces0[Traces]
direction TB
v2["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L335 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 335] metadata</a>"]
v3["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L358 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 358] &</a>"]
v4["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L358 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 358] in</a>"]
v5["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L358 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 358] rel_nested_dir</a>"]
v6["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L359 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 359] nested_dir</a>"]
v7["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L375 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 375] &</a>"]
v8["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L374 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 374] evaluate_rules</a>"]
v9["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L421 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 421] project_root</a>"]
v10["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L482 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 482] absolute</a>"]
end
v2 --> v3
v3 --> v4
v4 --> v5
v5 --> v6
v6 --> v7
v7 --> v8
v8 --> v9
v9 --> v10
%% Sink
subgraph Sink
direction LR
v1["<a href=https://github.com/dallay/agentsync/blob/564017c1055a4a806d0d9bf8e92d60869fe8220a/src/skills/detect.rs#L483 target=_blank style='text-decoration:none; color:#1c7fd6'>[Line: 483] absolute</a>"]
end
end
%% Class Assignment
Source:::invis
Sink:::invis
Traces0:::invis
File0:::invis
%% Connections
Source --> Traces0
Traces0 --> Sink
To resolve this comment:
✨ Commit fix suggestion
- Normalize and validate the file path to prevent path traversal before using it with
fs::read_to_stringor similar file system functions. - Use a helper function like
canonical_existing_pathto canonicalize and check the path. You can do this by adding:let file_path = canonical_existing_path(file_path).ok()?;before any code that reads the file, such asfs::read_to_string(file_path). - Only proceed with file operations if the call to
canonical_existing_pathreturnsSome(path), meaning the path is valid and canonical. - Make sure that all dynamic path construction using user input (such as joins with user-derived values) passes through this validation step before being used by the application.
Canonicalizing the path with canonical_existing_path ensures that attacker-controlled input cannot break out of the intended directory, which helps to prevent path traversal vulnerabilities.
💬 Ignore this finding
Reply with Semgrep commands to ignore this finding.
/fp <comment>for false positive/ar <comment>for acceptable risk/other <comment>for all other reasons
Alternatively, triage in Semgrep AppSec Platform to ignore the finding created by tainted-path.
You can view more details about this finding in the Semgrep AppSec Platform.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.agents/journal/bolt.md:
- Around line 156-157: The Markdown heading "## 2026-05-24 - Content Caching and
Gradle Discovery for Tech Detection" is missing a blank line beneath it; update
the .agents/journal/bolt.md entry by inserting a single blank line immediately
after that heading to satisfy MD022 linting (ensure the heading and the
following paragraph are separated by one empty line).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 04ba79d7-eda9-431c-8b91-e3f51e173b18
📒 Files selected for processing (2)
.agents/journal/bolt.mdsrc/skills/detect.rs
| ## 2026-05-24 - Content Caching and Gradle Discovery for Tech Detection | ||
| **Learning:** Even with single-pass metadata, technology detection was performing redundant I/O and allocations by re-reading shared configuration files (e.g., `setup.py`) and re-calculating Gradle layout paths for every matching rule. This created an \(O(T \times F)\) overhead in I/O and heap allocations. |
There was a problem hiding this comment.
Add a blank line after the heading to satisfy Markdown linting.
Line 156 is missing the required blank line below the heading (MD022), which can keep doc-lint warnings noisy in CI.
Suggested fix
## 2026-05-24 - Content Caching and Gradle Discovery for Tech Detection
+
**Learning:** Even with single-pass metadata, technology detection was performing redundant I/O and allocations by re-reading shared configuration files (e.g., `setup.py`) and re-calculating Gradle layout paths for every matching rule. This created an \(O(T \times F)\) overhead in I/O and heap allocations.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## 2026-05-24 - Content Caching and Gradle Discovery for Tech Detection | |
| **Learning:** Even with single-pass metadata, technology detection was performing redundant I/O and allocations by re-reading shared configuration files (e.g., `setup.py`) and re-calculating Gradle layout paths for every matching rule. This created an \(O(T \times F)\) overhead in I/O and heap allocations. | |
| ## 2026-05-24 - Content Caching and Gradle Discovery for Tech Detection | |
| **Learning:** Even with single-pass metadata, technology detection was performing redundant I/O and allocations by re-reading shared configuration files (e.g., `setup.py`) and re-calculating Gradle layout paths for every matching rule. This created an \(O(T \times F)\) overhead in I/O and heap allocations. |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 156-156: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.agents/journal/bolt.md around lines 156 - 157, The Markdown heading "##
2026-05-24 - Content Caching and Gradle Discovery for Tech Detection" is missing
a blank line beneath it; update the .agents/journal/bolt.md entry by inserting a
single blank line immediately after that heading to satisfy MD022 linting
(ensure the heading and the following paragraph are separated by one empty
line).



Optimized the technology detection system by implementing content caching for shared configuration files and pre-discovering layout-specific markers (like Gradle files). This eliminates redundant filesystem I/O and heap allocations during the evaluation of detection rules, making the
agentsync skill suggestcommand measurably faster.PR created automatically by Jules for task 9856788744421460774 started by @yacosta738