Skip to content

feat(classification): implement file path classification for POSIX, Windows, and registry paths#121

Merged
unclesp1d3r merged 19 commits into
mainfrom
17-implement-file-path-classification-for-posix-windows-and-registry-paths
Jan 18, 2026
Merged

feat(classification): implement file path classification for POSIX, Windows, and registry paths#121
unclesp1d3r merged 19 commits into
mainfrom
17-implement-file-path-classification-for-posix-windows-and-registry-paths

Conversation

@unclesp1d3r
Copy link
Copy Markdown
Member

@unclesp1d3r unclesp1d3r commented Jan 17, 2026

Summary

This PR implements comprehensive file path and registry path classification for the semantic classifier, completing task 9.3 from the implementation plan. The changes add detection for POSIX paths, Windows paths, UNC network paths, and Windows registry paths with validation and suspicious path identification.

Impact: 10 files changed (+1,228 additions, -182 deletions)
Risk Level: Low - adds new functionality without modifying existing behavior
Review Time: ~15-20 minutes


What Changed

Source Changes

  • src/classification/semantic.rs (+655 lines): Core implementation of path classification
  • src/classification/mod.rs (+10 lines): Module documentation updates

Test Changes

  • tests/classification_integration.rs (+152 lines): Integration tests for all path types
  • tests/snapshots/classification_integration__classification_snapshots.snap (+43 lines): Snapshot tests

Documentation Changes

  • docs/src/classification.md (+86/-168 lines): Updated classification documentation
  • .github/copilot-instructions.md (+208 lines): New Copilot instructions file
  • AGENTS.md (+69 lines): AI agent guidelines
  • CLAUDE.md (+1 line): Claude configuration

Configuration Changes

  • .claude/settings.json (+5 lines): Claude Code plugin settings
  • .kiro/specs/stringy-binary-analyzer/tasks.md (+9 lines): Task tracking update

Implementation Details

Path Classification Features

Path Type Pattern Example
POSIX ^/[^\0\n\r]* /usr/bin/bash
Windows ^[A-Za-z]:\\[^\0\n\r]* C:\Windows\System32\cmd.exe
UNC ^\\\\[a-zA-Z0-9.-]+\\[^\0\n\r]* \\server\share\file.txt
Registry (full) ^HKEY_[A-Z_]+\\[^\0\n\r]* HKEY_LOCAL_MACHINE\SOFTWARE\...
Registry (abbrev) `^HK(LM CU

Validation Rules

  • Rejects null bytes, newlines, carriage returns
  • Rejects double path separators (// for POSIX, \\ for Windows after initial UNC prefix)
  • Enforces reasonable length limits (4096 max)
  • POSIX paths must be absolute (start with /)
  • Windows paths must use backslashes with valid drive letter
  • Registry paths validated against known root keys

Suspicious Path Detection

The classifier identifies known persistence/execution locations:

POSIX: /etc/cron.d/, /etc/init.d/, /tmp/, /var/tmp/, /Library/LaunchDaemons/

Windows: C:\Windows\System32\, C:\Windows\Temp\, ...\AppData\Roaming\...\Startup\

Registry: \SOFTWARE\Microsoft\Windows\CurrentVersion\Run, \System\CurrentControlSet\Services


Architecture

classify(&FoundString)
    |
    +-- classify_posix_path() --> is_valid_posix_path() --> Tag::FilePath
    |
    +-- classify_windows_path() --> is_valid_windows_path() --> Tag::FilePath
    |
    +-- classify_unc_path() --> UNC validation --> Tag::FilePath
    |
    +-- classify_registry_path() --> is_valid_registry_path() --> Tag::RegistryPath

All regex patterns are compiled once via lazy_static! and cached across classifier instances.


How Has This Been Tested?

Unit Tests (in semantic.rs)

  • POSIX absolute paths
  • POSIX home directories
  • POSIX paths with spaces
  • POSIX system directories
  • POSIX suspicious path detection
  • POSIX invalid paths (relative, null bytes)
  • Windows absolute paths
  • Windows Program Files paths
  • Windows paths with spaces
  • Windows different drive letters
  • Windows suspicious path detection
  • Windows case insensitivity
  • Windows invalid paths (forward slashes, invalid drives)
  • UNC paths with server/share
  • UNC paths with domain names
  • UNC invalid paths
  • Registry Run key paths
  • Registry Current User paths
  • Registry abbreviated forms (HKLM, HKCU, etc.)
  • Registry invalid root keys
  • Registry forward slash rejection

Integration Tests (in classification_integration.rs)

  • Mixed indicator classification
  • All path types classification
  • Performance test (1050 samples < 500ms)
  • Regex caching verification
  • False positive rejection
  • Format string exclusion
  • Snapshot tests for regression detection

Review Checklist

General

  • Code follows project style guidelines (ASCII only, no emojis)
  • Self-review completed
  • Comments added for complex logic
  • No debugging code left
  • No sensitive data exposed

Code Quality

  • No code duplication (shared validation helpers)
  • Functions are focused and small
  • Variable names are descriptive
  • Error handling returns Option<Tag> consistently
  • No performance bottlenecks (lazy_static caching)

Testing

  • All new code is covered by tests
  • Tests are meaningful and test edge cases
  • Snapshot tests for regression detection
  • Performance test ensures CI stability

Security

  • Input validation implemented (null bytes, length limits)
  • No SQL injection vulnerabilities (N/A)
  • Suspicious path detection aids security analysis

Breaking Changes

None. This PR adds new functionality without modifying existing behavior.


Dependencies

No new dependencies added. Uses existing:

  • lazy_static for regex caching
  • regex for pattern matching
  • insta for snapshot testing

Related Issues

Closes task 9.3: Implement file path classification

  • POSIX file path pattern matching
  • Windows file path pattern matching
  • Registry path detection
  • Unit tests for path classification

Generated with Claude Code

unclesp1d3r and others added 6 commits January 17, 2026 15:19
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
…d Windows

- Added POSIX and Windows file path pattern matching
- Included registry path detection
- Comprehensive unit and integration tests added

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Implement POSIX, Windows, and UNC path classification
- Add support for Windows registry path detection
- Update documentation and tests for new capabilities

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Fix MSRV reference from 1.91+ to 1.85+ in copilot instructions
- Reorganize classification docs for clarity
- Increase classification test timeout for CI reliability

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings January 17, 2026 21:29
@unclesp1d3r unclesp1d3r linked an issue Jan 17, 2026 that may be closed by this pull request
17 tasks
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 17, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough
  • New Features

    • Adds comprehensive file and Windows registry path classification to the semantic classifier: POSIX absolute paths, Windows drive-letter paths, UNC network paths, and Windows registry paths (full HKEY_* and abbreviated HK* forms).
    • Validation rules: reject null bytes/newlines/carriage returns, disallow consecutive path separators (with UNC exceptions), enforce a 4096-character maximum, require POSIX absolute roots, require backslashes and valid drive letters for Windows paths, and verify registry root keys. Placeholder/format-string heuristics and separator checks reduce false positives.
    • Flags suspicious persistence/execution locations:
      • POSIX: /etc/cron.d/, /etc/init.d/, /usr/local/bin/, /tmp/, /var/tmp/, /Library/LaunchDaemons/, etc.
      • Windows: C:\Windows\System32, C:\Windows\Temp, Start Menu Startup/AppData Roaming Startup locations, SysWOW64
      • Registry: Run/RunOnce keys, CurrentControlSet Services, Winlogon, Explorer Shell Folders
    • Regex patterns compiled once and cached via lazy_static; RegexCacheAddresses exposes addresses for introspection and cache-stability tests.
  • Architectural Impact

    • Large, concentrated expansion of semantic classification logic in src/classification/semantic.rs (~+753 lines). Path detection, validation, suspiciousness heuristics, and many helper methods are centralized on SemanticClassifier.
    • Documentation updated to reflect a FoundString-based classify(&FoundString) entry point and new per-path helper methods; many planning/docs/config files added.
    • No new runtime dependencies introduced; project uses lazy_static for regex caching and retains current dependency set.
  • API Changes

    • Additive public surface on SemanticClassifier (implemented):
      • classify(&self, string: &FoundString) -> Vec (classify now accepts FoundString).
      • classify_posix_path / classify_windows_path / classify_unc_path / classify_registry_path -> Option.
      • is_valid_posix_path / is_valid_windows_path / is_valid_registry_path -> bool.
      • is_suspicious_posix_path / is_suspicious_windows_path / is_suspicious_registry_path -> bool.
      • regex_cache_addresses() -> RegexCacheAddresses (pub struct exposing cache addresses).
    • Changes are additive. Reviewers should confirm intended visibility/stability of new helpers and update any external call sites/examples to use the FoundString-based classify signature.
  • Binary / Format Handling

    • Validators explicitly reject embedded nulls and control characters to avoid misclassifying binary-contaminated strings.
    • Heuristics (separator normalization checks, placeholder/format-string detection, 4096 length cap) are applied to reduce false positives from formatted strings and binary blobs while aiming to retain true positives.
    • IP/URL detection remains two-stage (regex prefilter + std::net parsing) and coexists with path detection — path rules are designed to avoid accidental overlap with other token types (e.g., version-like numbers, IPv4).
  • Semantic Classification Accuracy & Tests

    • Extensive tests added (tests/classification_integration.rs and many unit tests in semantic.rs) covering:
      • Mixed indicators (URL, Domain, IPv4/IPv6, FilePath, RegistryPath).
      • POSIX, Windows, UNC, and Registry path detection, validation, and suspiciousness heuristics.
      • False-positive/negative scenarios (random data, format strings, version-like numbers).
      • Regex cache stability and a performance micro-benchmark (batch target < 500ms in test).
      • Insta snapshot coverage of text→tags mapping.
    • Review focus: validate regex correctness (UNC semantics, registry abbrev handling), consecutive-separator rules, drive-letter validation, thread-safety of cached regexes, and that the 4096-char/placeholder heuristics do not create unacceptable false negatives for legitimate long paths.
  • Docs, CI & Tooling

    • docs/src/classification.md updated to the new classify(&FoundString) signature and documents helper APIs.
    • Cargo.toml sets rust-version = "1.91" and lib/bin entries were added; rust-toolchain.toml pins toolchain to 1.91.0. CI includes an MSRV job pinned to 1.91.0.
    • Numerous documentation, planning, and configuration files added; no runtime dependency additions.
  • Review Notes / Risks

    • High review effort on src/classification/semantic.rs due to regex complexity and validation logic. Key checks for reviewers:
      • Regex correctness and intended anchors/character classes (POSIX/Windows/UNC/registry).
      • Edge cases: UNC initial backslashes, consecutive-separator handling, allowed vs disallowed control characters, and drive-letter/forward-slash rejection.
      • Performance at scale beyond micro-benchmark and thread-safety of lazy_static regex usage.
      • Confirm that the classify(&FoundString) signature change is propagated where needed and documented examples updated.
    • Confirm CI (just ci-check) and tests pass as requested by the PR author before merging.

Walkthrough

Adds robust POSIX/Windows/UNC file-path and Windows registry detection to the SemanticClassifier (regex cache, validation and suspiciousness heuristics, new public classify_/is_* helpers). Integrates these into classify(), adds extensive integration tests, updates docs/examples, and introduces numerous onboarding, planning, and CI/configuration files.

Changes

Cohort / File(s) Summary
Core classification logic
src/classification/semantic.rs
Large functional addition: RegexCacheAddresses, POSIX/Windows/UNC/registry regexes, validation and suspiciousness heuristics, public APIs (classify_posix_path, classify_windows_path, classify_unc_path, classify_registry_path, is_valid_*, is_suspicious_*), and integration into classify(). High-risk, logic-dense changes.
Integration tests
tests/classification_integration.rs
New comprehensive integration tests (mixed indicators, POSIX/Windows/UNC/Registry coverage, false-positive checks, performance benchmark, regex cache stability, insta snapshot). Review for test assumptions and flakiness.
Docs: classification & examples
docs/src/classification.md, src/classification/mod.rs
Documentation updated: new helper signatures documented, pipeline simplified, examples updated to use file-path examples. Verify docs match implemented APIs.
Tooling & CI
rust-toolchain.toml, .github/workflows/ci.yml, Cargo.toml
Toolchain pinned to Rust 1.91.0; MSRV CI job added; rust-version = "1.91" and explicit [lib]/[[bin]] entries in Cargo.toml. Check CI gating and MSRV expectations.
Configuration & agent docs
.claude/settings.json, /.github/copilot-instructions.md, AGENTS.md, CLAUDE.md, /.coderabbit.yml
New onboarding and agent/automation configuration files; .claude/settings.json enables commit@cc-marketplace plugin. Review for sensitive info and desired defaults.
Planning, specs & project docs
/.kiro/specs/stringy-binary-analyzer/tasks.md, project_plan/specs/*, project_plan/tickets/*, codebase_analysis.md
Many spec and planning documents added/updated (extraction, core flows, technical plan, tickets). Non-functional but important for roadmap alignment.
Repository examples / tests snippets
test_let_chains.rs
New small test/example file demonstrating let-chains; low risk.

Sequence Diagram(s)

(Skipped)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🧭 Paths and keys now learn to show,
Regex nets where subtle markers go.
Classifier tuned with careful art,
Tests assert each tag and part.
🔍✨

🚥 Pre-merge checks | ✅ 5 | ❌ 3
❌ Failed checks (3 warnings)
Check name Status Explanation Resolution
No Unsafe Code ⚠️ Warning The crate lacks the #![forbid(unsafe_code)] directive in src/lib.rs, leaving it vulnerable to future unsafe code introduction despite the PR introducing none. Add #![forbid(unsafe_code)] as the first non-documentation attribute in src/lib.rs to enforce compile-time unsafe-free constraint.
Ascii Only ⚠️ Warning PR introduces Unicode rightwards arrow (→, U+2192) characters in code comments at lines 502 and 519, violating ASCII-only requirement. Replace → with ASCII equivalents like -> or 'to' to maintain ASCII-only constraint compliance.
File Size Limit ⚠️ Warning src/classification/semantic.rs exceeds 500-line limit at 1,628 lines; documented split strategy not implemented. Refactor semantic.rs into mod.rs with submodules: network.rs, filesystem.rs, identifiers.rs, each under 500 lines with unified re-exports.
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed Title follows Conventional Commits format with type 'feat', valid scope 'classification', and accurately describes the main implementation: file path classification for POSIX, Windows, and registry paths.
Description check ✅ Passed Description provides comprehensive context directly related to the changeset: implementation details, validation rules, test coverage, and architecture. All content is relevant to the file path and registry classification features added.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 70.00%.
Error Handling ✅ Passed Public API functions return Option and bool without Result types, matching project convention of limiting Result to initialization. No panic, unwrap, or expect calls present. Regex compilation errors handled at startup via lazy_static.
Section Weight Consistency ✅ Passed No container parser files were modified in this PR; check not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 17-implement-file-path-classification-for-posix-windows-and-registry-paths

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes three minor improvements to the codebase: fixes the MSRV reference from an incorrect 1.91+ to the correct 1.85+, reorganizes classification documentation for better clarity, and increases the classification performance test timeout from 100ms to 500ms to improve CI reliability.

Changes:

  • Updated MSRV references in AGENTS.md and copilot-instructions.md from 1.91+ to 1.85+
  • Reorganized and expanded classification documentation with clearer explanations of patterns, validation rules, and implementation details
  • Increased classification test timeout to 500ms to accommodate slower CI environments

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
AGENTS.md New AI agent guidelines file documenting project standards and development patterns
CLAUDE.md Pointer file referencing AGENTS.md for Claude Code integration
.claude/settings.json Claude Code plugin configuration
.github/copilot-instructions.md Comprehensive Copilot instructions including corrected MSRV of 1.85+
.kiro/specs/stringy-binary-analyzer/tasks.md Updated task tracking to mark path classification as complete
docs/src/classification.md Reorganized documentation with expanded details on patterns, validation, and implementation
src/classification/mod.rs Updated module documentation to reflect current capabilities
src/classification/semantic.rs Large file (1530 lines) implementing path and registry classification with comprehensive tests
tests/classification_integration.rs New integration test file with performance tests and snapshot testing
tests/snapshots/classification_integration__classification_snapshots.snap Snapshot file for classification tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/src/classification.md Outdated
Comment thread src/classification/semantic.rs Outdated
Comment on lines +139 to 145
if self.classify_posix_path(&string.text).is_some()
|| self.classify_windows_path(&string.text).is_some()
|| self.classify_unc_path(&string.text).is_some()
{
tags.push(Tag::FilePath);
}

Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation shows path classification logic that would prevent duplicate Tag::FilePath entries (lines 139-143), while the actual implementation (semantic.rs lines 383-394) could theoretically add Tag::FilePath multiple times if a string matched multiple path patterns. While path patterns should be mutually exclusive in practice, the documentation's approach is more defensive. Consider either updating the implementation to match the documentation's pattern, or updating the documentation to reflect the actual implementation.

Suggested change
if self.classify_posix_path(&string.text).is_some()
|| self.classify_windows_path(&string.text).is_some()
|| self.classify_unc_path(&string.text).is_some()
{
tags.push(Tag::FilePath);
}
if self.classify_posix_path(&string.text).is_some() {
tags.push(Tag::FilePath);
}
if self.classify_windows_path(&string.text).is_some() {
tags.push(Tag::FilePath);
}
if self.classify_unc_path(&string.text).is_some() {
tags.push(Tag::FilePath);
}

Copilot uses AI. Check for mistakes.
}
let elapsed = start.elapsed();

assert!(elapsed < Duration::from_millis(500));
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test timeout has been increased from 100ms to 500ms (5x increase) for CI reliability. While this may help with flaky tests on slower CI runners, consider whether this masks a performance issue. The test processes 1050 samples (350 iterations × 3 samples each), and taking up to 500ms for this workload may indicate inefficient regex matching. Consider profiling the classification performance or adding a note about expected performance characteristics.

Copilot uses AI. Check for mistakes.
@unclesp1d3r unclesp1d3r changed the title chore: minor docs and test adjustments feat(classification): implement file path classification for POSIX, Windows, and registry paths Jan 17, 2026
@unclesp1d3r
Copy link
Copy Markdown
Member Author

@copilot open a new pull request to apply changes based on the comments in this thread

Run just ci-check before committing. Any failure of a check or test, caused by your changes or not, must be fixed.

Copy link
Copy Markdown
Contributor

Copilot AI commented Jan 17, 2026

@unclesp1d3r I've opened a new pull request, #122, to work on those changes. Once the pull request is ready, I'll request review from you.

…ion performance and consistency (#122)

* Initial plan

* fix: address code review feedback on path classification

- Clarify Windows path separator validation in docs
- Optimize registry path matching with zero-allocation case-insensitive search
- Prevent duplicate Tag::FilePath entries
- Add comment explaining 500ms performance test timeout

Co-authored-by: unclesp1d3r <251112+unclesp1d3r@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: unclesp1d3r <251112+unclesp1d3r@users.noreply.github.com>
coderabbitai[bot]
coderabbitai Bot previously approved these changes Jan 17, 2026
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
This configuration file includes general settings, review guidelines, path filters, and instructions for various modules to streamline code reviews and maintain project standards.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
@coderabbitai coderabbitai Bot added the enhancement New feature or request label Jan 17, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In @.coderabbit.yml:
- Around line 254-256: Add the rust-version field to the Cargo.toml [package]
section to enforce the documented MSRV: add rust-version = "1.85" under the
existing [package] block (alongside fields like edition = "2024") so Cargo will
require Rust 1.85+; update the Cargo.toml [package] metadata and ensure the
value exactly matches "1.85".

In @.github/copilot-instructions.md:
- Around line 9-11: The fenced code block containing "Binary → Format Detection
(goblin) → Container Parsing → String Extraction → Deduplication →
Classification → Ranking → Output" needs a language specifier to satisfy MD040;
update the block in .github/copilot-instructions.md by changing the fence from
``` to a language-tagged fence (e.g., ```text) so the code block becomes ```text
... ``` and retains the same inner content.
- Around line 71-78: The markdown code block for the "Combined scoring formula"
is missing a language specifier; update the triple-backtick fence for that block
in the Combined scoring formula section to include a language tag (e.g., text)
so the block reads ```text and preserves plain formatting; locate the block by
the header "Combined scoring formula" or by the formula lines starting with
"base_score = max(occurrence.original_score)" and change the opening fence to
include the language specifier.
- Line 28: Replace the non-ASCII less-than-or-equal character '≤' in the
markdown heading "**File size limit**: Keep files ≤500-600 lines; split larger
files into focused modules" with an ASCII alternative (e.g., "<=" or the words
"no more than") to comply with the ASCII-only guideline; update the sentence to
read "**File size limit**: Keep files <=500-600 lines; split larger files into
focused modules" or "**File size limit**: Keep files no more than 500-600 lines;
split larger files into focused modules".

In `@src/classification/semantic.rs`:
- Around line 855-859: The code uppercases the input into root_upper via
to_ascii_uppercase() and then calls eq_ignore_ascii_case in the any() closure;
replace the redundant case-insensitive comparison with a simple equality check
against the already-uppercased string (change the any(|valid|
valid.eq_ignore_ascii_case(&root_upper)) usage to a direct equality comparison
such as any(|valid| valid == &root_upper) or equivalent), referencing root_upper
and VALID_REGISTRY_ROOTS and removing eq_ignore_ascii_case.

Comment thread .coderabbit.yml
Comment thread .github/copilot-instructions.md Outdated
Comment thread .github/copilot-instructions.md Outdated
Comment thread .github/copilot-instructions.md
Comment thread src/classification/semantic.rs
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In @.github/copilot-instructions.md:
- Around line 9-11: The documentation line containing "Binary → Format Detection
(goblin) → Container Parsing → String Extraction → Deduplication →
Classification → Ranking → Output" (and any similar occurrences such as the one
noted at 30-30) uses non-ASCII punctuation; replace all Unicode arrows (→) with
ASCII "->" and any em/en dashes (—/–) with a simple hyphen "-" so the file
(.github/copilot-instructions.md) is ASCII-only; search for the string "Binary →
Format Detection" and similar pipeline lines to update them consistently.

In `@Cargo.toml`:
- Line 5: CI is currently testing with dtolnay/rust-toolchain@1.91.0 while
Cargo.toml declares rust-version = "1.85", and rust-toolchain.toml uses channel
= "stable"; update the GitHub workflow(s) to include Rust 1.85 (either add a
matrix entry or a separate job that pins toolchain to 1.85) so the MSRV is
actually compiled in CI, and either pin rust-toolchain.toml to toolchain =
"1.85.0" (or a specific patch) or remove it and rely on the explicit CI
configuration to avoid channel drift.

In `@codebase_analysis.md`:
- Around line 30-32: The directory tree shows a machine-specific Windows path
"D:\Stringy\" which breaks portability; update the repo root in
codebase_analysis.md to a platform-neutral repo-relative root (e.g., "./" or
simply the project directory name) so the tree is portable across environments
and remove the Windows-specific backslash path string "D:\Stringy\" from the
document.

In `@src/classification/semantic.rs`:
- Around line 812-839: The Windows path validator is missing a rejection for
control characters and the registry validator lacks the 4096-length cap; update
is_valid_windows_path to immediately reject any presence of control chars (such
as '\0', '\n', '\r', and other ASCII C0 controls) before performing prefix/regex
checks, and add the same control-char check and a maximum length check (4096) to
the registry validation function (the function around lines 842-869, e.g.,
is_valid_registry_path or similarly named). Implement these checks as simple
early returns using a helper like contains_control_chars(text) or inline byte
scans so regex/prefix matching cannot be fooled by embedded controls; keep the
existing printf-placeholder and prefix logic afterwards.
- Around line 684-701: The classify_unc_path function currently returns
Some(Tag::FilePath) after only a regex check and non-empty server/share; extend
it to perform robust validation: after UNC_PATH_REGEX and extracting
server/share (in classify_unc_path), enforce max-length limits (e.g., <=255 per
component and overall reasonable limit), reject control characters (chars <=
0x1F or 0x7F), reject forward slashes '/' and printf-style placeholders ('%'
followed by format specifiers) anywhere in server or share, reject reserved
names or components consisting solely of dots, and ensure there are no empty
segments (detect double '\\' after the initial two by iterating over parts from
trimmed.split('\\') to confirm every segment is non-empty and valid). Only
return Some(Tag::FilePath) if all these extra validations pass; otherwise return
None.
- Around line 726-735: The path checks in is_suspicious_windows_path are
currently case-sensitive; make them case-insensitive by comparing lowercase
forms: compute a lowered version of text (e.g., text.to_ascii_lowercase()) and
for each SUSPICIOUS_WINDOWS_PATHS entry use its lowered form
(prefix.to_ascii_lowercase()) and then perform the same logic (if prefix starts
with '\\' use lowered_text.contains(lowered_prefix) else
lowered_text.starts_with(lowered_prefix)). Update references to
SUSPICIOUS_WINDOWS_PATHS and the starts_with/contains calls accordingly in
is_suspicious_windows_path.
- Around line 113-121: The two static regexes REGISTRY_PATH_REGEX and
REGISTRY_ABBREV_REGEX only accept uppercase roots, causing lowercase/mixed-case
registry roots to be rejected before is_valid_registry_path() runs; update both
Regex::new patterns to include the case-insensitive inline flag (?i) (e.g.
r"(?i)^HKEY_[A-Z_]+\\[^\x00\n\r]*" and r"(?i)^HK(LM|CU|CR|U|CC)\\[^\x00\n\r]*")
so they match roots case-insensitively while preserving the rest of the pattern
and behavior.

Comment thread .github/copilot-instructions.md
Comment thread Cargo.toml Outdated
name = "stringy"
version = "0.1.0"
edition = "2024"
rust-version = "1.85"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check rust-version in Cargo.toml
echo "=== Cargo.toml rust-version ==="
rg -n 'rust-version' Cargo.toml

# Find and display rust-toolchain files
echo -e "\n=== rust-toolchain files ==="
fd -a 'rust-toolchain' -x cat {}

# Check for .github/workflows for rust version pins
echo -e "\n=== GitHub CI configurations ==="
fd -a '\.yml$' '.github/workflows' 2>/dev/null | head -5 | xargs -I {} sh -c 'echo "File: {}"; rg "toolchain|rust-version|1\.8[0-9]|1\.9[0-9]" {} || true'

# Check for .circleci config
echo -e "\n=== CircleCI config ==="
fd 'config\.yml' '.circleci' 2>/dev/null | xargs -I {} sh -c 'echo "File: {}"; rg "toolchain|rust-version|1\.8[0-9]|1\.9[0-9]" {} || true'

Repository: EvilBit-Labs/Stringy

Length of output: 951


CI toolchain must test against MSRV 1.85, not 1.91.0.

rust-version = "1.85" is correctly set, but the GitHub workflows are pinned to dtolnay/rust-toolchain@1.91.0. This creates a validation gap: MSRV compliance is never tested. The code could fail to compile on Rust 1.85 and the CI would not catch it. Additionally, rust-toolchain.toml uses an unpinned channel = "stable", which drifts over time.

Update CI workflows to test against 1.85 (add a matrix job or separate workflow), and pin rust-toolchain.toml to a fixed version or remove it in favor of explicit CI configuration.

🤖 Prompt for AI Agents
In `@Cargo.toml` at line 5, CI is currently testing with
dtolnay/rust-toolchain@1.91.0 while Cargo.toml declares rust-version = "1.85",
and rust-toolchain.toml uses channel = "stable"; update the GitHub workflow(s)
to include Rust 1.85 (either add a matrix entry or a separate job that pins
toolchain to 1.85) so the MSRV is actually compiled in CI, and either pin
rust-toolchain.toml to toolchain = "1.85.0" (or a specific patch) or remove it
and rely on the explicit CI configuration to avoid channel drift.

Comment thread codebase_analysis.md
Comment thread src/classification/semantic.rs Outdated
Comment thread src/classification/semantic.rs
Comment thread src/classification/semantic.rs Outdated
Comment thread src/classification/semantic.rs
- Introduced detailed documentation for core user interactions in Stringy v1.0.
- Added technical plan outlining architecture, design decisions, and integration strategies.
- Included performance optimization ticket and end-to-end testing strategy.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
@coderabbitai coderabbitai Bot added the documentation Improvements or additions to documentation label Jan 18, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@project_plan/specs/Core_Flows__Stringy_v1.0_User_Interactions.md`:
- Around line 85-93: The markdown has multiple fenced code blocks missing
language identifiers (e.g., the table starting with "String                     
Tags        Score  Section", the block containing
"https://malicious-c2.example.com/api", the YARA rule "rule binary_strings", and
the JSON blocks already noted), which prevents proper syntax highlighting;
update each fenced block in Core_Flows__Stringy_v1.0_User_Interactions.md to
include an appropriate language tag—use "text" for plain output/table blocks
(the String/Tags/Score table and the URL/plain output and summary blocks),
"json" for JSON blocks, and "yara" for YARA rules like "rule binary_strings"—so
every triple-backtick fence becomes ```text, ```json, or ```yara as appropriate.

In `@project_plan/specs/Epic_Brief__Stringy_v1.0_Completion.md`:
- Line 48: The heading "CLI Interface" is tautological; update the spec heading
to use either "CLI" or "Command-line interface" (not both) and ensure the
description beneath (file: src/main.rs) still lists the missing options
(--min-len, --enc, --only-tags, --notags, --top, --json) without the redundant
wording; specifically replace the heading "CLI Interface" with "CLI" or
"Command-line interface" and run a quick pass to remove any other instances of
the redundant phrase in this section.

In `@project_plan/specs/Technical_Plan__Stringy_v1.0_Implementation.md`:
- Around line 99-123: Update the fenced code blocks so they include language
specifiers: change the diagram block to start with ```mermaid and change all
Rust examples to start with ```rust (for example the blocks containing pub
struct FoundString and the other Rust snippets referenced in the document).
Locate the mermaid sequenceDiagram block and each Rust code snippet (e.g.,
declarations and examples like FoundString and the functions in the Rust
examples) and add the appropriate language identifier to the opening
triple-backtick fence to enable proper syntax highlighting.
- Around line 1-510: The SemanticClassifier implementation in
src/classification/semantic.rs is 1542 lines and must be split into a module
with submodules to respect the 500-line limit; refactor by creating
src/classification/semantic/mod.rs that exposes pub struct SemanticClassifier
and pub fn classify(...) and move logical groups of functions/regexes into new
submodules network.rs, filesystem.rs, identifiers.rs, and encoding.rs (each
implementing small helper functions called by SemanticClassifier::classify),
migrate the once_cell regex caches into the appropriate submodule scopes, add
mod declarations and pub use in semantic/mod.rs to re-export types and
functions, and update any imports that referenced symbols in the old file to
point to semantic::... so compilation and behavior (including SymbolDemangler
and RankingEngine callers) remain unchanged.

In `@project_plan/tickets/Implement_CLI_argument_parsing_and_validation.md`:
- Line 5: Replace the redundant phrase "CLI interface" in the ticket description
(the line containing "Complete the CLI interface with all filtering flags,
output format selection, and proper validation.") with a simplified wording such
as "Complete the CLI with all filtering flags, output format selection, and
proper validation." to remove the duplication; update the markdown heading/text
accordingly wherever that exact phrase appears in the document.

Comment on lines +85 to +93
```
String Tags Score Section
https://malicious-c2.example.com/api url 95 .rdata
C:\Windows\System32\kernel32.dll filepath 88 .rdata
core::fmt::Display::fmt export 85 .text
192.168.1.100 ipv4 82 .rodata
{A1B2C3D4-E5F6-7890-ABCD-EF1234567890} guid 80 .data
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language identifiers to code blocks for proper rendering.

Multiple code blocks are missing language identifiers, which prevents proper syntax highlighting and may cause rendering issues. Examples, command outputs, and JSON blocks should specify their language.

📝 Suggested fixes

For example output blocks (like lines 85-93), add "text":

-```
+```text
 String                                    Tags        Score  Section

For JSON blocks (like lines 176-179), ensure "json" is specified (already correct in the doc).

For YARA output (lines 213-230), add "yara":

-```
+```yara
 rule binary_strings {

For plain output (lines 269-275), add "text":

-```
+```text
 https://malicious-c2.example.com/api

For summary output (lines 301-311), add "text":

-```
+```text
 [Results table here]

Based on static analysis hints from markdownlint-cli2.

Also applies to: 176-179, 213-230, 269-275, 301-311

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

85-85: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In `@project_plan/specs/Core_Flows__Stringy_v1.0_User_Interactions.md` around
lines 85 - 93, The markdown has multiple fenced code blocks missing language
identifiers (e.g., the table starting with "String                              
Tags        Score  Section", the block containing
"https://malicious-c2.example.com/api", the YARA rule "rule binary_strings", and
the JSON blocks already noted), which prevents proper syntax highlighting;
update each fenced block in Core_Flows__Stringy_v1.0_User_Interactions.md to
include an appropriate language tag—use "text" for plain output/table blocks
(the String/Tags/Score table and the URL/plain output and summary blocks),
"json" for JSON blocks, and "yara" for YARA rules like "rule binary_strings"—so
every triple-backtick fence becomes ```text, ```json, or ```yara as appropriate.


4. **Output Formatting** (file:src/output/): Only basic interfaces exist. Users cannot get JSONL for automation, human-readable tables for manual review, or YARA-friendly output for rule creation.

5. **CLI Interface** (file:src/main.rs): Missing filtering options (--min-len, --enc, --only-tags, --notags, --top) and output format selection (--json).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Tautological phrasing (“CLI Interface”).

“CLI” already means “Command Line Interface.” Consider “CLI” or “command-line interface” to avoid redundancy.

🧰 Tools
🪛 LanguageTool

[style] ~48-~48: This phrase is redundant (‘I’ stands for ‘Interface’). Use simply “CLIInterface”.
Context: ...riendly output for rule creation. 5. CLI Interface (file:src/main.rs): Missing filtering...

(ACRONYM_TAUTOLOGY)

🤖 Prompt for AI Agents
In `@project_plan/specs/Epic_Brief__Stringy_v1.0_Completion.md` at line 48, The
heading "CLI Interface" is tautological; update the spec heading to use either
"CLI" or "Command-line interface" (not both) and ensure the description beneath
(file: src/main.rs) still lists the missing options (--min-len, --enc,
--only-tags, --notags, --top, --json) without the redundant wording;
specifically replace the heading "CLI Interface" with "CLI" or "Command-line
interface" and run a quick pass to remove any other instances of the redundant
phrase in this section.

Comment on lines +1 to +510
# Technical Plan: Stringy v1.0 Implementation

## Overview

This technical plan defines the architecture for completing Stringy v1.0, building on the existing foundation of format detection, container parsing, and string extraction. The implementation adds semantic classification, intelligent ranking, symbol demangling, flexible output formatting, and CLI orchestration.

## Architectural Approach

### Core Design Decisions

**1. Pipeline Architecture**

The main execution flow will be encapsulated in a `Pipeline` struct that orchestrates the entire analysis process. This provides:

- Clear entry point for the analysis workflow
- Centralized error handling and recovery
- Testability through dependency injection
- Progress tracking integration

**Trade-off**: Struct-based approach adds a layer of abstraction but provides better testability and maintainability compared to functional composition in main().

**2. Enum-Based Output Formatting**

Output formatters will use an enum-based approach with a single format() function that matches on the output type. This provides:

- Simplicity and directness for the 3 required formats
- Easy to understand and maintain
- No trait abstraction overhead
- Sufficient for current requirements

**Trade-off**: Less extensible than trait-based approach, but simpler and more appropriate for the limited number of formatters. Future formats can be added as enum variants.

**3. Memory-Mapped File I/O with Fallback**

File reading will attempt memory mapping first, with automatic fallback to regular file reading on failure. This provides:

- Efficient memory-mapped access for most cases
- Robustness for edge cases (network filesystems, locked files, platform limitations)
- Consistent behavior across all file sizes
- Zero-copy access when possible

**Trade-off**: Slightly more complex than always-on mmap, but handles real-world failure scenarios gracefully.

**4. Modern Regex Caching**

Migrate from `lazy_static` to `once_cell` for regex pattern caching. This provides:

- More modern, ergonomic API
- Better compile-time guarantees
- Consistent with Rust ecosystem trends
- Simpler initialization patterns

**Trade-off**: Requires dependency migration but improves code quality and maintainability.

**5. Rich Progress Feedback**

Use `indicatif` library for progress indicators. This provides:

- Professional progress bars and spinners
- Automatic TTY detection
- Minimal code for rich feedback
- Consistent user experience

**Trade-off**: Adds external dependency but provides significantly better UX than manual eprintln! calls.

### Technical Constraints

**Codebase Standards:**

- No `unsafe` code (`#![forbid(unsafe_code)]` enforced)
- Zero warnings (`cargo clippy -- -D warnings` must pass)
- ASCII-only text (no Unicode punctuation or emojis)
- File size limit: 500 lines per file (split larger files)
- No blanket `#[allow]` attributes

**Module Organization:**

- file:src/classification/ - Semantic analysis and ranking
- file:src/output/ - Output formatters
- file:src/main.rs - CLI and pipeline orchestration
- file:src/lib.rs - Public API and re-exports

**Error Handling:**

- Use `thiserror` for all error types
- Include context (offsets, section names, file paths)
- Graceful degradation where possible
- Clear error messages for user-facing errors

### Integration Strategy

The new components integrate with existing infrastructure:

1. **Classification Integration**: New semantic patterns and symbol demangling extend existing `SemanticClassifier` in file:src/classification/semantic.rs
2. **Ranking Integration**: New `RankingEngine` in src/classification/ranking.rs consumes `FoundString` objects with section weights from container parsers
3. **Output Integration**: New formatters in file:src/output/ consume ranked `Vec<FoundString>` from pipeline
4. **CLI Integration**: file:src/main.rs orchestrates all components through `Pipeline` struct

```mermaid
sequenceDiagram
participant CLI as main.rs
participant Pipeline
participant Container as container/*
participant Extractor as extraction/*
participant Classifier as classification/*
participant Ranker as ranking.rs
participant Formatter as output/*

CLI->>Pipeline: new(config)
CLI->>Pipeline: run(file_path)
Pipeline->>Container: detect_format() & parse()
Container-->>Pipeline: ContainerInfo
Pipeline->>Extractor: extract(data, container_info)
Extractor-->>Pipeline: Vec<FoundString>
Pipeline->>Classifier: classify(strings)
Classifier->>Classifier: demangle_symbols()
Classifier-->>Pipeline: Vec<FoundString> with tags
Pipeline->>Ranker: calculate_scores(strings)
Ranker-->>Pipeline: Vec<FoundString> with scores
Pipeline->>Pipeline: apply_filters() & sort()
Pipeline->>Formatter: format(strings)
Formatter-->>CLI: Output to stdout
```

---

## Data Model

### FoundString Enhancement

Extend the existing `FoundString` struct in file:src/types.rs to include score breakdown for transparency and debugging:

```rust
pub struct FoundString {
// Existing fields
pub text: String,
pub encoding: Encoding,
pub offset: u64,
pub rva: Option<u64>,
pub section: Option<String>,
pub length: u32,
pub tags: Vec<Tag>,
pub score: i32, // Final calculated score
pub source: StringSource,
pub confidence: f32,

// New fields for symbol demangling
pub original_text: Option<String>, // Original mangled form (if demangled)

// Optional debug fields (only populated with --debug flag)
pub section_weight: Option<i32>, // Score from section type
pub semantic_boost: Option<i32>, // Bonus from semantic tags
pub noise_penalty: Option<i32>, // Penalty from noise detection
}
```

**Rationale**:

- `original_text` preserves the mangled symbol for cross-referencing and recovery
- Breakdown fields (section_weight, semantic_boost, noise_penalty) are optional to avoid exposing internal implementation details in the public API
- With --debug flag, breakdown fields are populated for debugging and validation
- Without --debug, breakdown fields remain None, keeping the API simple and flexible

### Ranking Configuration

New configuration struct for ranking parameters in src/classification/ranking.rs:

```rust
pub struct RankingConfig {
pub section_weights: HashMap<SectionType, i32>,
pub tag_boosts: HashMap<Tag, i32>,
pub noise_penalty_config: NoisePenaltyConfig,
}

pub struct NoisePenaltyConfig {
pub high_entropy_penalty: i32,
pub excessive_length_penalty: i32,
pub repeated_pattern_penalty: i32,
}
```

**Integration**: `RankingConfig` uses hardcoded sensible defaults. No user configuration is provided - the defaults are designed to work well across all use cases (malware analysis, reverse engineering, general analysis).

### Filter Configuration

New configuration struct for CLI filtering in file:src/main.rs:

```rust
pub struct FilterConfig {
pub min_length: Option<usize>,
pub encodings: Option<Vec<Encoding>>,
pub include_tags: Option<Vec<Tag>>,
pub exclude_tags: Option<Vec<Tag>>,
pub top_n: Option<usize>,
}
```

**Integration**: Built from CLI arguments, passed to Pipeline. Pipeline applies filters after ranking using iterator adapters.

### Output Formatter Interface

Trait definition for output formatters in file:src/output/mod.rs:

```rust
pub enum OutputFormat {
Table,
Json,
Yara,
}

pub struct OutputMetadata {
pub binary_name: String,
pub binary_format: BinaryFormat,
pub total_strings: usize,
pub filtered_strings: usize,
}

pub fn format_output(
format: OutputFormat,
strings: &[FoundString],
metadata: &OutputMetadata,
) -> Result<String>;
```

**Rationale**: Enum-based design is simpler and more direct for the three required formats. The format_output function matches on the enum and delegates to format-specific logic. Metadata provides context for formatters to include summary information.

### Tag Enum Extensions

Add new variants to existing `Tag` enum in file:src/types.rs with specificity levels:

```rust
pub enum Tag {
// Existing tags (specific)
Url, Domain, IPv4, IPv6, FilePath, RegistryPath,
Import, Export, Version, Manifest, Resource,
DylibPath, Rpath, RpathVariable, FrameworkPath,

// New specific tags for v1.0
Guid, // GUIDs/UUIDs (specific)
Email, // Email addresses (specific)
FormatString, // Printf-style format strings (specific)
UserAgent, // User agent strings (specific)
DemangledSymbol, // Demangled Rust/C++ symbols (specific)

// Broad/ambiguous tags
Base64, // Base64-encoded data (broad - many false positives)
}
```

**Tag Specificity**: Tags are categorized as specific (high confidence, low false positives) or broad (lower confidence, higher false positives). A string can have multiple tags. Specific tags like Email are prioritized over broad tags like Base64 in display and ranking.

**Integration**: New tags follow existing pattern. Classification logic in file:src/classification/semantic.rs will be extended to detect these patterns. False negatives are worse than false positives - we prefer to tag liberally.

---

## Component Architecture

### 1. Ranking Engine (src/classification/ranking.rs)

**Purpose**: Calculate relevance scores for strings based on multiple factors.

**Interface**:

```rust
pub struct RankingEngine {
config: RankingConfig,
}

impl RankingEngine {
pub fn new(config: RankingConfig) -> Self;
pub fn calculate_score(&self, string: &mut FoundString);
pub fn rank_strings(&self, strings: &mut [FoundString]);
}
```

**Responsibilities**:

- Apply section weight scoring based on `SectionType`
- Apply semantic boost scoring based on tags
- Calculate noise penalties from confidence scores
- Populate score breakdown fields
- Sort strings by final score

**Integration**: Consumes `FoundString` objects after classification, uses section weights from `ContainerInfo`, applies tag-based boosts.

**File Size**: Keep under 500 lines. If scoring logic exceeds limit, split into src/classification/ranking/mod.rs with submodules for section_weights.rs, semantic_boosts.rs, noise_penalties.rs.

### 2. Symbol Demangling (src/classification/symbols.rs)

**Purpose**: Demangle Rust and C++ symbols to human-readable form while preserving original.

**Interface**:

```rust
pub struct SymbolDemangler {
// Uses rustc-demangle crate
}

impl SymbolDemangler {
pub fn new() -> Self;
pub fn demangle(&self, string: &mut FoundString);
pub fn is_mangled(&self, symbol: &str) -> bool;
}
```

**Responsibilities**:

- Detect mangled Rust symbols (starts with `_ZN` or `_R`)
- Demangle using `rustc-demangle` crate
- Store original mangled form in `FoundString.original_text`
- Replace `FoundString.text` with demangled version
- Tag demangled symbols with `DemangledSymbol` tag
- Handle demangling failures gracefully (keep original text, no tag)

**Integration**: Called during classification phase. Processes strings with `Import` or `Export` tags. Modifies FoundString in-place, preserving original in original_text field.

**Dependency**: Add `rustc-demangle` to Cargo.toml.

### 3. Semantic Classification Extensions (file:src/classification/semantic.rs)

**Purpose**: Extend existing classifier with new pattern detection.

**New Patterns**:

- **GUID**: Regex for standard GUID format `{XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}` (specific)
- **Email**: Regex with basic validation ([user@domain.tld](mailto:user@domain.tld)) (specific)
- **Base64**: Pattern detection for Base64-encoded data (length, character set) (broad - many false positives)
- **Format String**: Detection of printf-style format specifiers (%s, %d, %x, etc.) (specific)
- **User Agent**: Pattern matching for common user agent strings (specific)

**Tag Specificity Strategy**:

- Apply all matching patterns - a string can have multiple tags
- Prefer false positives over false negatives (better to tag liberally)
- Specific tags (Email, GUID, FormatString) have higher confidence
- Broad tags (Base64) are applied with lower confidence but still useful
- Tag priority for display handles showing most relevant tags first

**Integration**: Extend existing `SemanticClassifier::classify()` method with new pattern checks. Use `once_cell` for regex caching.

**File Size**: Current file is approaching 500 lines. If additions exceed limit, split into src/classification/semantic/mod.rs with submodules for network.rs, filesystem.rs, identifiers.rs, encoding.rs.

### 4. Output Formatters (file:src/output/)

**Module Structure**:

- file:src/output/mod.rs - OutputFormat enum and format_output() function
- src/output/table.rs - Table formatting logic
- src/output/json.rs - JSONL formatting logic
- src/output/yara.rs - YARA rule generation logic

**Table Formatter** (src/output/table.rs):

- Detect TTY vs non-TTY output using `atty` or `std::io::IsTerminal`
- TTY: Format as aligned table with columns (String | Tags | Score | Section)
- Non-TTY: Output plain strings, one per line
- Handle long strings (show in full, terminal wraps)
- Show primary tags (comma-separated if multiple at same priority)

**JSON Formatter** (src/output/json.rs):

- Output JSONL (one JSON object per line)
- Include all `FoundString` fields (text, encoding, offset, rva, section, length, tags, score, source, confidence)
- Include original_text if present (demangled symbols)
- Include score breakdown fields only if populated (--debug mode)
- Proper escaping for JSON strings via serde_json

**YARA Formatter** (src/output/yara.rs):

- Generate complete YARA rule template
- Sanitize binary filename for rule name (replace non-alphanumeric with underscore, remove extension, add "_strings" suffix)
- Include metadata section (description, tool, date, file hash)
- Escape strings according to YARA syntax (backslashes, quotes, newlines)
- Skip strings over 200 characters with comment: "// Skipped: too long (N chars)"
- Include both `ascii` and `wide` modifiers for compatibility

**Integration**: Pipeline calls format_output() with selected OutputFormat enum variant. The function matches on the enum and delegates to the appropriate formatting module.

### 5. Pipeline Orchestration (file:src/main.rs)

**Purpose**: Coordinate the entire analysis workflow including filtering.

**Structure**:

```rust
pub struct Pipeline {
config: PipelineConfig,
progress: ProgressBar, // from indicatif
}

pub struct PipelineConfig {
extraction_config: ExtractionConfig,
ranking_config: RankingConfig,
filter_config: FilterConfig,
debug_mode: bool,
}

impl Pipeline {
pub fn new(config: PipelineConfig) -> Self;
pub fn run(&mut self, file_path: &Path) -> Result<Vec<FoundString>>;
}
```

**Workflow**:

1. Display "Parsing..." progress indicator
2. Attempt memory-map file using `memmap2`, fall back to `std::fs::read()` on failure
3. Detect format and parse container (fail fast on error)
4. Display "Extracting..." progress indicator
5. Extract strings using `BasicExtractor` (fail fast on critical errors)
6. Display "Classifying..." progress indicator
7. Apply semantic classification (graceful degradation - skip failed strings)
8. Apply symbol demangling (graceful degradation - keep original on failure)
9. Display "Ranking..." progress indicator
10. Calculate scores using `RankingEngine` (populate breakdown fields if debug_mode)
11. Apply filters from FilterConfig (min-len, encoding, tags)
12. Sort by score and apply --top limit
13. Format output using selected OutputFormat enum
14. Write to stdout

**Error Handling Strategy**:

*Critical Stages (fail fast):*

- File access: Exit with error if file not found or cannot be read
- Format detection: Exit with error if format unsupported
- Container parsing: Exit with error if binary is corrupted or invalid

*Optional Stages (graceful degradation):*

- Classification: If classification fails on individual strings, skip those strings and continue
- Symbol demangling: If demangling fails, keep original text and continue
- Ranking: If ranking fails, output unranked strings with warning
- Formatting: If primary formatter fails, attempt plain text fallback

*Recovery Strategy*:

- Memory mapping failure: Automatically fall back to regular file reading
- Partial results: If some strings are processed successfully, output them with warning about failures
- No strings found: Display informational message to stderr, exit 0 (not an error)

**Progress Feedback**: Use `indicatif::ProgressBar` with spinner style. Progress messages go to stderr, results to stdout.

**CLI Filtering**: Filter logic is part of `Pipeline::run()` to keep main.rs under 500-line limit:

- Pipeline receives filter configuration from CLI args
- Filters applied after ranking, before output formatting
- Uses iterator adapters: `filter()` for criteria, `take()` for --top
- Filter validation happens during Pipeline initialization

**Integration**: Pipeline owns the entire process including filtering. CLI argument parsing uses `clap` derive macros. Main function is minimal: parse args, create Pipeline, call run(), handle output.

### 6. Performance Optimizations

**Memory Mapping** (file:src/main.rs):

- Attempt `memmap2::Mmap` first for efficient access
- On mmap failure (network filesystem, locked file, platform limitations), fall back to `std::fs::read()`
- Pass byte slice to container parsers (works with both mmap and regular read)
- Log fallback to regular file reading for user awareness

**Regex Caching** (file:src/classification/semantic.rs):

- Migrate from `lazy_static` to `once_cell::sync::Lazy`
- Pre-compile all regex patterns at first use
- Share compiled patterns across all classification calls

**Dependency Additions**:

- `memmap2` - Memory-mapped file I/O with fallback
- `once_cell` - Modern lazy initialization (migrate from lazy_static)
- `indicatif` - Progress bars and spinners
- `rustc-demangle` - Rust symbol demangling
- `atty` or use `std::io::IsTerminal` - TTY detection for output formatting

### Integration Points Summary


| Component | Consumes | Produces | Integration Point |
| ------------------ | ------------------------------------ | ----------------------------- | ----------------------------------------------- |
| Pipeline | CLI args, file path | Formatted output | Orchestrates all components + filtering |
| RankingEngine | Vec&lt;FoundString&gt;, debug flag | Scored Vec&lt;FoundString&gt; | Called after classification |
| SymbolDemangler | &mut FoundString | () | Called during classification, modifies in-place |
| SemanticClassifier | FoundString | Vec&lt;Tag&gt; | Extended with new patterns |
| format_output() | OutputFormat, Vec&lt;FoundString&gt; | String | Enum-based dispatch to formatters |


### Testing Strategy

**Unit Tests**:

- Ranking: Test score calculation with known inputs
- Symbol demangling: Test with mangled/unmangled symbols
- Semantic patterns: Test each new pattern with positive/negative cases
- Output formatters: Test with sample data, verify format correctness

**Integration Tests**:

- End-to-end pipeline tests with fixture binaries
- CLI argument parsing and filtering
- Output format validation with `insta` snapshots
- Error handling scenarios

**Benchmarks**:

- Ranking performance with large string sets
- Regex pattern matching performance
- Memory mapping vs regular file I/O
- Overall pipeline throughput

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "semantic.rs" -type f | head -5

Repository: EvilBit-Labs/Stringy

Length of output: 96


🏁 Script executed:

wc -l src/classification/semantic.rs

Repository: EvilBit-Labs/Stringy

Length of output: 99


Split src/classification/semantic.rs to meet the 500-line file size constraint.

The semantic classifier is 1542 lines, exceeding the 500-line limit defined in the technical plan (lines 73–74). The plan explicitly documents the splitting strategy at line 341: split into src/classification/semantic/mod.rs with submodules for network.rs, filesystem.rs, identifiers.rs, and encoding.rs. Refactor accordingly to maintain architectural consistency and improve maintainability.

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

125-125: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


59-59: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


317-317: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


85-85: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


221-221: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


277-277: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


311-311: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


32-32: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


11-11: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


22-22: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


33-33: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


44-44: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


55-55: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


485-485: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


125-125: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In `@project_plan/specs/Technical_Plan__Stringy_v1.0_Implementation.md` around
lines 1 - 510, The SemanticClassifier implementation in
src/classification/semantic.rs is 1542 lines and must be split into a module
with submodules to respect the 500-line limit; refactor by creating
src/classification/semantic/mod.rs that exposes pub struct SemanticClassifier
and pub fn classify(...) and move logical groups of functions/regexes into new
submodules network.rs, filesystem.rs, identifiers.rs, and encoding.rs (each
implementing small helper functions called by SemanticClassifier::classify),
migrate the once_cell regex caches into the appropriate submodule scopes, add
mod declarations and pub use in semantic/mod.rs to re-export types and
functions, and update any imports that referenced symbols in the old file to
point to semantic::... so compilation and behavior (including SymbolDemangler
and RankingEngine callers) remain unchanged.

Comment on lines +99 to +123
```mermaid
sequenceDiagram
participant CLI as main.rs
participant Pipeline
participant Container as container/*
participant Extractor as extraction/*
participant Classifier as classification/*
participant Ranker as ranking.rs
participant Formatter as output/*

CLI->>Pipeline: new(config)
CLI->>Pipeline: run(file_path)
Pipeline->>Container: detect_format() & parse()
Container-->>Pipeline: ContainerInfo
Pipeline->>Extractor: extract(data, container_info)
Extractor-->>Pipeline: Vec<FoundString>
Pipeline->>Classifier: classify(strings)
Classifier->>Classifier: demangle_symbols()
Classifier-->>Pipeline: Vec<FoundString> with tags
Pipeline->>Ranker: calculate_scores(strings)
Ranker-->>Pipeline: Vec<FoundString> with scores
Pipeline->>Pipeline: apply_filters() & sort()
Pipeline->>Formatter: format(strings)
Formatter-->>CLI: Output to stdout
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Minor: Add language specifiers to code blocks.

Several code blocks are missing language specifiers, which affects syntax highlighting and readability. The mermaid diagram (lines 99-123) and Rust code examples (lines 133-154, 168-179, 204-223, 264-273, 294-303, 386-401) should have appropriate language identifiers.

Example fix for Rust code blocks

Change from:

```
pub struct FoundString {

To:

```rust
pub struct FoundString {

Also applies to: 133-154, 168-179, 204-223, 264-273, 294-303, 386-401

🤖 Prompt for AI Agents
In `@project_plan/specs/Technical_Plan__Stringy_v1.0_Implementation.md` around
lines 99 - 123, Update the fenced code blocks so they include language
specifiers: change the diagram block to start with ```mermaid and change all
Rust examples to start with ```rust (for example the blocks containing pub
struct FoundString and the other Rust snippets referenced in the document).
Locate the mermaid sequenceDiagram block and each Rust code snippet (e.g.,
declarations and examples like FoundString and the functions in the Rust
examples) and add the appropriate language identifier to the opening
triple-backtick fence to enable proper syntax highlighting.


## Objective

Complete the CLI interface with all filtering flags, output format selection, and proper validation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor: Redundant phrasing.

"CLI interface" is redundant since CLI already stands for "Command Line Interface". Consider simplifying to "Complete the CLI with all filtering flags..."

📝 Suggested fix
-Complete the CLI interface with all filtering flags, output format selection, and proper validation.
+Complete the CLI with all filtering flags, output format selection, and proper validation.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Complete the CLI interface with all filtering flags, output format selection, and proper validation.
Complete the CLI with all filtering flags, output format selection, and proper validation.
🧰 Tools
🪛 LanguageTool

[style] ~5-~5: This phrase is redundant (‘I’ stands for ‘interface’). Use simply “CLI”.
Context: ... validation ## Objective Complete the CLI interface with all filtering flags, output format...

(ACRONYM_TAUTOLOGY)

🤖 Prompt for AI Agents
In `@project_plan/tickets/Implement_CLI_argument_parsing_and_validation.md` at
line 5, Replace the redundant phrase "CLI interface" in the ticket description
(the line containing "Complete the CLI interface with all filtering flags,
output format selection, and proper validation.") with a simplified wording such
as "Complete the CLI with all filtering flags, output format selection, and
proper validation." to remove the duplication; update the markdown heading/text
accordingly wherever that exact phrase appears in the document.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Adjusted numbering format in user interaction flows for consistency.
- Enhanced clarity in the Epic Brief by refining problem statements.
- Improved formatting in the Technical Plan for better readability.
- Updated performance optimization ticket to reflect new dependencies.
- Enhanced clarity in the pipeline orchestration ticket scope.
- Refined semantic classification ticket to include additional patterns.
- Improved integration testing ticket to cover more binary types.
- Updated FoundString data model ticket for clarity on new fields.
- Enhanced CLI argument parsing ticket to include all flags.
- Improved output formatters ticket to clarify scope and dependencies.
- Refined ranking system ticket to ensure clarity on scoring logic.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Bump rust-version in Cargo.toml to 1.91
- Update rust-toolchain.toml to use channel 1.91.0
- Enhance semantic classification for Windows paths with case-insensitivity and additional validation checks
- Add new test file for let chains example

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
@unclesp1d3r unclesp1d3r merged commit 0f580b4 into main Jan 18, 2026
18 of 20 checks passed
@unclesp1d3r unclesp1d3r deleted the 17-implement-file-path-classification-for-posix-windows-and-registry-paths branch January 18, 2026 02:39
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In `@project_plan/specs/Epic_Brief__Stringy_v1.0_Completion.md`:
- Around line 11-27: Replace bold pseudo-headings (e.g., "**Primary Users:**"
and "**Problem 1: Signal vs Noise**", "**Problem 2: Cryptic Symbols**",
"**Problem 3: No Prioritization**") with proper Markdown headings (for example
"### Primary Users" and "#### Problem 1: Signal vs Noise") so they are
recognized as real headings by markdownlint MD036; update each similar bold
label in the file to an appropriate heading level consistent with existing
structure (e.g., "###" for section titles, "####" for sub-sections) and remove
the surrounding bold markers.

In
`@project_plan/tickets/Add_performance_optimizations_and_dependency_updates.md`:
- Around line 9-28: The document uses bold labels "**In Scope:**" and "**Out of
Scope:**" instead of markdown headings; update the file to use proper markdown
headings (e.g., "## In Scope" and "## Out of Scope") and convert the subsection
lists under them to be clearly nested under those headings so the structure is
semantic and parsable; ensure the header level matches the document hierarchy
(use "###" for sublists if needed) and preserve all listed items (memmap2,
once_cell, indicatif, rustc-demangle, memory mapping fallback behavior,
lazy_static migration, indicatif integration, benchmarks, docs update, and the
out-of-scope notes) while removing the bolded label style.

In `@project_plan/tickets/End-to-end_integration_testing_and_validation.md`:
- Around line 9-31: Replace the bolded section labels "**In Scope:**" and "**Out
of Scope:**" with proper Markdown headings (for example "## In Scope" and "##
Out of Scope") in the document so they are real headings instead of emphasized
text; update the two occurrences matching the literal phrases "In Scope:" and
"Out of Scope:" in the file to use heading syntax and keep the existing lists
and content unchanged to satisfy markdownlint MD036.

In
`@project_plan/tickets/Extend_FoundString_data_model_for_demangling_and_debug_support.md`:
- Around line 9-21: Replace the bold label lines "**In Scope:**" and "**Out of
Scope:**" with proper markdown headings (e.g., "## In Scope" and "## Out of
Scope") so they conform to markdownlint MD036; update the surrounding lists
under the headings as needed to preserve the same content and spacing and ensure
serialization/tests/docs references to these sections still read the new heading
text.

In
`@project_plan/tickets/Implement_enum-based_output_formatters_`(Table,_JSON,_YARA).md:
- Around line 9-36: Replace the bold labels "**In Scope:**" and "**Out of
Scope:**" with proper markdown headings (e.g., "## In Scope" and "## Out of
Scope") so the document uses headings instead of bold text for section titles;
update the file's sections where the enum-based output formatter plan defines
scope and out-of-scope items (the sections currently titled "In Scope:" and "Out
of Scope:") and ensure any similar bolded section labels in that file are
converted to appropriate heading levels for compliance with MD036.

In `@src/classification/semantic.rs`:
- Around line 171-180: KNOWN_WINDOWS_PREFIXES currently only lists C:\ paths so
paths on other drive letters won't match; update the heuristic by either
replacing the fixed HashSet with a drive-letter pattern check (e.g., test for a
leading "[A-Za-z]:\" prefix) or expand the set to include other drive letters
(generate "D:\\", "E:\\", etc.), and update any callers that rely on
KNOWN_WINDOWS_PREFIXES accordingly (reference: KNOWN_WINDOWS_PREFIXES in
src/classification/semantic.rs).
- Around line 829-842: The contains_printf_placeholder function currently only
detects a small set of single-letter specifiers; update it to also catch
security-relevant specifiers like %p and %n (and optionally other common
specifiers such as %i, %c, %e, %g and length-augmented forms like %ld/%lld).
Modify contains_printf_placeholder to, after seeing '%', consume any
flags/width/precision/length tokens (or switch to a small regex) and then check
the final specifier character(s) for 'p' or 'n' (and the additional letters you
want to support) instead of only peeking one char; reference the
contains_printf_placeholder function to locate and change the '%' handling
logic. Ensure the function still returns true on detection and false otherwise.
♻️ Duplicate comments (3)
project_plan/tickets/Implement_CLI_argument_parsing_and_validation.md (1)

5-5: Redundant phrasing: "CLI interface"

CLI already stands for "Command Line Interface." Simplify to "Complete the CLI with all filtering flags..."

project_plan/specs/Core_Flows__Stringy_v1.0_User_Interactions.md (1)

85-93: Add language identifiers to fenced code blocks.

Code blocks at lines 85, 215, and 271 are missing language specifiers. Use text for output examples and table formats.

project_plan/specs/Technical_Plan__Stringy_v1.0_Implementation.md (1)

1-519: Technical plan is comprehensive and well-structured.

The architectural decisions are sound for a binary analysis tool. The pipeline architecture, graceful degradation strategy, and explicit error handling boundaries align with the security-focused nature of the project.

A few notes on consistency with current implementation:

  1. once_cell migration (lines 44-53, 474-478): The plan specifies migrating from lazy_static to once_cell, but Cargo.toml still lists lazy_static as a dependency. The PR objectives confirm lazy_static is the current implementation. Consider adding a tracking note or ticket reference for when this migration should occur.

  2. File size constraint (lines 73-74, 353): The 500-line limit and splitting strategy for semantic.rs is documented here, but as noted in a previous review, the file currently exceeds this limit.

The language specifier issues on fenced code blocks have already been flagged in previous reviews.

Comment on lines +11 to +27
**Primary Users:**

- **Security researchers and malware analysts** who need to quickly identify indicators of compromise (IOCs), behavioral patterns, and malicious infrastructure in binaries
- **Reverse engineers** who need to understand binary functionality, dependencies, and internal structure through string analysis
- **Open-source community** seeking a modern, intelligent alternative to decades-old tools like GNU strings
- **DevOps and security teams** who need to integrate string extraction into automated analysis pipelines

These users share a common need: efficient, intelligent binary analysis that surfaces meaningful information without manual noise filtering.

### Current Pain Points

**Problem 1: Signal vs Noise** Traditional tools like GNU strings extract every printable character sequence, producing thousands of results where 90%+ are meaningless - padding bytes, binary tables, random data that happens to be printable. Users must manually scan through this noise to find the 10% that matters: URLs, file paths, registry keys, function names. This manual filtering is time-consuming, error-prone, and doesn't scale.

**Problem 2: Cryptic Symbols** Modern binaries contain mangled symbols (especially Rust, C++) that appear as cryptic strings like `_ZN4core3fmt3num52_$LT$impl$`. Without demangling, users cannot understand what functions are called, what libraries are used, or what the binary actually does. This forces users to copy-paste symbols into external demangling tools, breaking their workflow.

**Problem 3: No Prioritization** Even when users find potentially interesting strings, they have no way to know which ones are most important. Is this URL critical infrastructure or a help link? Is this file path a configuration file or a debug artifact? Without ranking and context, users waste time investigating low-value strings while missing critical indicators.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace bold pseudo-headings with real headings (markdownlint MD036).

Line 11 and the “Problem 1–4” labels (Lines 22–28) are bold text acting as headings. This trips MD036; please promote them to actual headings.

Suggested fix (apply similarly to all bold pseudo-headings)
-**Primary Users:**
+### Primary Users

-**Problem 1: Signal vs Noise** Traditional tools like GNU strings extract every printable character sequence, producing thousands of results where 90%+ are meaningless - padding bytes, binary tables, random data that happens to be printable. Users must manually scan through this noise to find the 10% that matters: URLs, file paths, registry keys, function names. This manual filtering is time-consuming, error-prone, and doesn't scale.
+#### Problem 1: Signal vs Noise
+Traditional tools like GNU strings extract every printable character sequence, producing thousands of results where 90%+ are meaningless - padding bytes, binary tables, random data that happens to be printable. Users must manually scan through this noise to find the 10% that matters: URLs, file paths, registry keys, function names. This manual filtering is time-consuming, error-prone, and doesn't scale.
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

11-11: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


22-22: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🤖 Prompt for AI Agents
In `@project_plan/specs/Epic_Brief__Stringy_v1.0_Completion.md` around lines 11 -
27, Replace bold pseudo-headings (e.g., "**Primary Users:**" and "**Problem 1:
Signal vs Noise**", "**Problem 2: Cryptic Symbols**", "**Problem 3: No
Prioritization**") with proper Markdown headings (for example "### Primary
Users" and "#### Problem 1: Signal vs Noise") so they are recognized as real
headings by markdownlint MD036; update each similar bold label in the file to an
appropriate heading level consistent with existing structure (e.g., "###" for
section titles, "####" for sub-sections) and remove the surrounding bold
markers.

Comment on lines +9 to +28
**In Scope:**

- Add dependencies to Cargo.toml:
- memmap2 for memory-mapped file I/O
- once_cell for modern lazy initialization
- indicatif for progress bars
- rustc-demangle for symbol demangling
- Implement memory mapping with fallback in Pipeline:
- Attempt memmap2::Mmap first
- Fall back to std::fs::read() on failure
- Log fallback for user awareness
- Migrate all lazy_static usage to once_cell::sync::Lazy
- Integrate indicatif progress indicators in Pipeline
- Add benchmarks for performance validation
- Update documentation with performance characteristics

**Out of Scope:**

- Pipeline implementation (separate ticket, but this ticket provides the tools)
- Classification implementation (separate ticket, but this ticket provides regex caching)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use proper markdown headings for subsections.

The "In Scope:" and "Out of Scope:" labels use bold emphasis instead of proper headings. While functional, this deviates from markdown best practices and makes the document structure less clear.

📝 Proposed fix
-**In Scope:**
+### In Scope
 
 - Add dependencies to Cargo.toml:
-**Out of Scope:**
+### Out of Scope
 
 - Pipeline implementation (separate ticket, but this ticket provides the tools)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**In Scope:**
- Add dependencies to Cargo.toml:
- memmap2 for memory-mapped file I/O
- once_cell for modern lazy initialization
- indicatif for progress bars
- rustc-demangle for symbol demangling
- Implement memory mapping with fallback in Pipeline:
- Attempt memmap2::Mmap first
- Fall back to std::fs::read() on failure
- Log fallback for user awareness
- Migrate all lazy_static usage to once_cell::sync::Lazy
- Integrate indicatif progress indicators in Pipeline
- Add benchmarks for performance validation
- Update documentation with performance characteristics
**Out of Scope:**
- Pipeline implementation (separate ticket, but this ticket provides the tools)
- Classification implementation (separate ticket, but this ticket provides regex caching)
### In Scope
- Add dependencies to Cargo.toml:
- memmap2 for memory-mapped file I/O
- once_cell for modern lazy initialization
- indicatif for progress bars
- rustc-demangle for symbol demangling
- Implement memory mapping with fallback in Pipeline:
- Attempt memmap2::Mmap first
- Fall back to std::fs::read() on failure
- Log fallback for user awareness
- Migrate all lazy_static usage to once_cell::sync::Lazy
- Integrate indicatif progress indicators in Pipeline
- Add benchmarks for performance validation
- Update documentation with performance characteristics
### Out of Scope
- Pipeline implementation (separate ticket, but this ticket provides the tools)
- Classification implementation (separate ticket, but this ticket provides regex caching)
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

11-11: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


22-22: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🤖 Prompt for AI Agents
In `@project_plan/tickets/Add_performance_optimizations_and_dependency_updates.md`
around lines 9 - 28, The document uses bold labels "**In Scope:**" and "**Out of
Scope:**" instead of markdown headings; update the file to use proper markdown
headings (e.g., "## In Scope" and "## Out of Scope") and convert the subsection
lists under them to be clearly nested under those headings so the structure is
semantic and parsable; ensure the header level matches the document hierarchy
(use "###" for sublists if needed) and preserve all listed items (memmap2,
once_cell, indicatif, rustc-demangle, memory mapping fallback behavior,
lazy_static migration, indicatif integration, benchmarks, docs update, and the
out-of-scope notes) while removing the bolded label style.

Comment on lines +9 to +31
**In Scope:**

- Expand tests/fixtures/ with diverse test binaries:
- ELF binaries with various string types
- PE binaries with resources and imports
- Mach-O binaries with load commands
- Binaries with mangled symbols
- Create end-to-end integration tests:
- Test complete pipeline with each binary format
- Test all output formats (table, JSON, YARA)
- Test filtering combinations
- Test error scenarios (corrupted binaries, unsupported formats)
- Test edge cases (no strings, all strings filtered out)
- Add insta snapshot tests for output validation
- Add CLI integration tests
- Validate against all requirements from file:.kiro/specs/stringy-binary-analyzer/requirements.md
- Add performance benchmarks for complete pipeline

**Out of Scope:**

- Unit tests for individual components (handled in component tickets)
- Implementation of components (separate tickets)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use headings instead of bold labels for section titles.

Lines 9–10 and 27–28 use bold text as headings, which violates markdownlint MD036.

Suggested fix
-**In Scope:**
+### In Scope

-**Out of Scope:**
+### Out of Scope
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**In Scope:**
- Expand tests/fixtures/ with diverse test binaries:
- ELF binaries with various string types
- PE binaries with resources and imports
- Mach-O binaries with load commands
- Binaries with mangled symbols
- Create end-to-end integration tests:
- Test complete pipeline with each binary format
- Test all output formats (table, JSON, YARA)
- Test filtering combinations
- Test error scenarios (corrupted binaries, unsupported formats)
- Test edge cases (no strings, all strings filtered out)
- Add insta snapshot tests for output validation
- Add CLI integration tests
- Validate against all requirements from file:.kiro/specs/stringy-binary-analyzer/requirements.md
- Add performance benchmarks for complete pipeline
**Out of Scope:**
- Unit tests for individual components (handled in component tickets)
- Implementation of components (separate tickets)
### In Scope
- Expand tests/fixtures/ with diverse test binaries:
- ELF binaries with various string types
- PE binaries with resources and imports
- Mach-O binaries with load commands
- Binaries with mangled symbols
- Create end-to-end integration tests:
- Test complete pipeline with each binary format
- Test all output formats (table, JSON, YARA)
- Test filtering combinations
- Test error scenarios (corrupted binaries, unsupported formats)
- Test edge cases (no strings, all strings filtered out)
- Add insta snapshot tests for output validation
- Add CLI integration tests
- Validate against all requirements from file:.kiro/specs/stringy-binary-analyzer/requirements.md
- Add performance benchmarks for complete pipeline
### Out of Scope
- Unit tests for individual components (handled in component tickets)
- Implementation of components (separate tickets)
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

11-11: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


22-22: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🤖 Prompt for AI Agents
In `@project_plan/tickets/End-to-end_integration_testing_and_validation.md` around
lines 9 - 31, Replace the bolded section labels "**In Scope:**" and "**Out of
Scope:**" with proper Markdown headings (for example "## In Scope" and "## Out
of Scope") in the document so they are real headings instead of emphasized text;
update the two occurrences matching the literal phrases "In Scope:" and "Out of
Scope:" in the file to use heading syntax and keep the existing lists and
content unchanged to satisfy markdownlint MD036.

Comment on lines +9 to +21
**In Scope:**

- Add `original_text: Option<String>` field to preserve mangled symbols
- Add optional breakdown fields: `section_weight: Option<i32>`, `semantic_boost: Option<i32>`, `noise_penalty: Option<i32>`
- Update serde serialization to handle new fields correctly
- Update existing tests to account for new fields
- Update documentation with field descriptions

**Out of Scope:**

- Actual demangling logic (handled in separate ticket)
- Score calculation logic (handled in separate ticket)
- CLI --debug flag implementation (handled in pipeline ticket)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use headings instead of bold labels for section titles.

Lines 9–10 and 17–21 use bold text as headings, which violates markdownlint MD036.

Suggested fix
-**In Scope:**
+### In Scope

-**Out of Scope:**
+### Out of Scope
🤖 Prompt for AI Agents
In
`@project_plan/tickets/Extend_FoundString_data_model_for_demangling_and_debug_support.md`
around lines 9 - 21, Replace the bold label lines "**In Scope:**" and "**Out of
Scope:**" with proper markdown headings (e.g., "## In Scope" and "## Out of
Scope") so they conform to markdownlint MD036; update the surrounding lists
under the headings as needed to preserve the same content and spacing and ensure
serialization/tests/docs references to these sections still read the new heading
text.

Comment on lines +9 to +36
**In Scope:**

- Create file:src/output/mod.rs with OutputFormat enum and format_output() function
- Create file:src/output/table.rs:
- TTY detection using std::io::IsTerminal
- Table formatting with columns: String | Tags | Score | Section
- Non-TTY plain text output (one string per line)
- Primary tag display with comma-separation for same-priority tags
- Create file:src/output/json.rs:
- JSONL output (one JSON object per line)
- Include all FoundString fields
- Include original_text if present
- Include breakdown fields only if populated
- Create file:src/output/yara.rs:
- Complete YARA rule template generation
- Binary filename sanitization for rule names
- String escaping for YARA syntax
- Skip strings over 200 chars with comment
- Include metadata section (description, tool, date)
- Add comprehensive unit tests for each formatter
- Add integration tests with insta snapshots

**Out of Scope:**

- CLI integration (separate ticket)
- Summary statistics formatting (handled in pipeline ticket)
- Progress feedback (separate ticket)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use headings instead of bold labels for section titles.

Lines 9–10 and 31–35 use bold text as headings, which violates markdownlint MD036.

Suggested fix
-**In Scope:**
+### In Scope

-**Out of Scope:**
+### Out of Scope
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**In Scope:**
- Create file:src/output/mod.rs with OutputFormat enum and format_output() function
- Create file:src/output/table.rs:
- TTY detection using std::io::IsTerminal
- Table formatting with columns: String | Tags | Score | Section
- Non-TTY plain text output (one string per line)
- Primary tag display with comma-separation for same-priority tags
- Create file:src/output/json.rs:
- JSONL output (one JSON object per line)
- Include all FoundString fields
- Include original_text if present
- Include breakdown fields only if populated
- Create file:src/output/yara.rs:
- Complete YARA rule template generation
- Binary filename sanitization for rule names
- String escaping for YARA syntax
- Skip strings over 200 chars with comment
- Include metadata section (description, tool, date)
- Add comprehensive unit tests for each formatter
- Add integration tests with insta snapshots
**Out of Scope:**
- CLI integration (separate ticket)
- Summary statistics formatting (handled in pipeline ticket)
- Progress feedback (separate ticket)
### In Scope
- Create file:src/output/mod.rs with OutputFormat enum and format_output() function
- Create file:src/output/table.rs:
- TTY detection using std::io::IsTerminal
- Table formatting with columns: String | Tags | Score | Section
- Non-TTY plain text output (one string per line)
- Primary tag display with comma-separation for same-priority tags
- Create file:src/output/json.rs:
- JSONL output (one JSON object per line)
- Include all FoundString fields
- Include original_text if present
- Include breakdown fields only if populated
- Create file:src/output/yara.rs:
- Complete YARA rule template generation
- Binary filename sanitization for rule names
- String escaping for YARA syntax
- Skip strings over 200 chars with comment
- Include metadata section (description, tool, date)
- Add comprehensive unit tests for each formatter
- Add integration tests with insta snapshots
### Out of Scope
- CLI integration (separate ticket)
- Summary statistics formatting (handled in pipeline ticket)
- Progress feedback (separate ticket)
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

11-11: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


22-22: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


33-33: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🤖 Prompt for AI Agents
In
`@project_plan/tickets/Implement_enum-based_output_formatters_`(Table,_JSON,_YARA).md
around lines 9 - 36, Replace the bold labels "**In Scope:**" and "**Out of
Scope:**" with proper markdown headings (e.g., "## In Scope" and "## Out of
Scope") so the document uses headings instead of bold text for section titles;
update the file's sections where the enum-based output formatter plan defines
scope and out-of-scope items (the sections currently titled "In Scope:" and "Out
of Scope:") and ensure any similar bolded section labels in that file are
converted to appropriate heading levels for compliance with MD036.

Comment on lines +171 to +180
/// Known valid Windows path prefixes
static ref KNOWN_WINDOWS_PREFIXES: std::collections::HashSet<&'static str> = {
let mut set = std::collections::HashSet::new();
set.insert("C:\\Windows\\");
set.insert("C:\\Program Files\\");
set.insert("C:\\Program Files (x86)\\");
set.insert("C:\\Users\\");
set.insert("C:\\ProgramData\\");
set
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider: KNOWN_WINDOWS_PREFIXES only covers C: drive.

The set only includes C:\ prefixed paths. Paths on other drives (D:, E:) won't match known prefixes. This is likely acceptable since it's used for heuristic validation (allowing longer unknown paths), not rejection. Verify this is the intended behavior.

🤖 Prompt for AI Agents
In `@src/classification/semantic.rs` around lines 171 - 180,
KNOWN_WINDOWS_PREFIXES currently only lists C:\ paths so paths on other drive
letters won't match; update the heuristic by either replacing the fixed HashSet
with a drive-letter pattern check (e.g., test for a leading "[A-Za-z]:\" prefix)
or expand the set to include other drive letters (generate "D:\\", "E:\\",
etc.), and update any callers that rely on KNOWN_WINDOWS_PREFIXES accordingly
(reference: KNOWN_WINDOWS_PREFIXES in src/classification/semantic.rs).

Comment on lines +829 to +842
fn contains_printf_placeholder(&self, text: &str) -> bool {
let mut chars = text.chars().peekable();

while let Some(ch) = chars.next() {
if ch == '%'
&& let Some(next) = chars.peek()
&& matches!(next, 's' | 'd' | 'x' | 'o' | 'u' | 'f')
{
return true;
}
}

false
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Printf placeholder detection is conservative.

The pattern only matches %s, %d, %x, %o, %u, %f. This misses %i, %p, %n, %c, %e, %g, %ld, %lld, etc. For a false-positive-reduction focus, this conservative approach is reasonable, but consider expanding to at least include %p (pointer) and %n (dangerous format string specifier) which are security-relevant.

Optional: Expand printf specifiers
 fn contains_printf_placeholder(&self, text: &str) -> bool {
     let mut chars = text.chars().peekable();

     while let Some(ch) = chars.next() {
         if ch == '%'
             && let Some(next) = chars.peek()
-            && matches!(next, 's' | 'd' | 'x' | 'o' | 'u' | 'f')
+            && matches!(next, 's' | 'd' | 'x' | 'o' | 'u' | 'f' | 'i' | 'p' | 'n' | 'c')
         {
             return true;
         }
     }

     false
 }
🤖 Prompt for AI Agents
In `@src/classification/semantic.rs` around lines 829 - 842, The
contains_printf_placeholder function currently only detects a small set of
single-letter specifiers; update it to also catch security-relevant specifiers
like %p and %n (and optionally other common specifiers such as %i, %c, %e, %g
and length-augmented forms like %ld/%lld). Modify contains_printf_placeholder
to, after seeing '%', consume any flags/width/precision/length tokens (or switch
to a small regex) and then check the final specifier character(s) for 'p' or 'n'
(and the additional letters you want to support) instead of only peeking one
char; reference the contains_printf_placeholder function to locate and change
the '%' handling logic. Ensure the function still returns true on detection and
false otherwise.

unclesp1d3r added a commit that referenced this pull request Feb 25, 2026
…indows, and registry paths (#121)

* feat(docs): add AI agent guidelines and character usage policy

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* chore(docs): revise AI agent guidelines for clarity and rules

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* chore(docs): update module structure formatting in documentation

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* feat(classification): implement file path classification for POSIX and Windows

- Added POSIX and Windows file path pattern matching
- Included registry path detection
- Comprehensive unit and integration tests added

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* feat(classification): enhance path and registry detection
- Implement POSIX, Windows, and UNC path classification
- Add support for Windows registry path detection
- Update documentation and tests for new capabilities

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* chore: minor docs and test adjustments

- Fix MSRV reference from 1.91+ to 1.85+ in copilot instructions
- Reorganize classification docs for clarity
- Increase classification test timeout for CI reliability


* fix(classification): address code review feedback on path classification performance and consistency (#122)

* Initial plan

* fix: address code review feedback on path classification

- Clarify Windows path separator validation in docs
- Optimize registry path matching with zero-allocation case-insensitive search
- Prevent duplicate Tag::FilePath entries
- Add comment explaining 500ms performance test timeout

Co-authored-by: unclesp1d3r <251112+unclesp1d3r@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: unclesp1d3r <251112+unclesp1d3r@users.noreply.github.com>

* chore: add comprehensive codebase analysis documentation

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* chore: add CodeRabbit configuration file for project setup

This configuration file includes general settings, review guidelines, path filters, and instructions for various modules to streamline code reviews and maintain project standards.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* chore: improve formatting and readability in codebase analysis

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* chore: update formatting in copilot instructions

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* chore: update Cargo.toml and codebase_analysis.md formatting

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* chore: refresh task list to reflect current implementation state

* chore: add documentation for core flows and technical plan

- Introduced detailed documentation for core user interactions in Stringy v1.0.
- Added technical plan outlining architecture, design decisions, and integration strategies.
- Included performance optimization ticket and end-to-end testing strategy.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* chore: add MSRV check to CI workflow

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* chore: update character restrictions in copilot instructions

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* chore: update documentation and improve formatting

- Adjusted numbering format in user interaction flows for consistency.
- Enhanced clarity in the Epic Brief by refining problem statements.
- Improved formatting in the Technical Plan for better readability.
- Updated performance optimization ticket to reflect new dependencies.
- Enhanced clarity in the pipeline orchestration ticket scope.
- Refined semantic classification ticket to include additional patterns.
- Improved integration testing ticket to cover more binary types.
- Updated FoundString data model ticket for clarity on new fields.
- Enhanced CLI argument parsing ticket to include all flags.
- Improved output formatters ticket to clarify scope and dependencies.
- Refined ranking system ticket to ensure clarity on scoring logic.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* chore: update directory structure path in analysis

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* chore: update Cargo.toml and rust-toolchain for Rust 1.91

- Bump rust-version in Cargo.toml to 1.91
- Update rust-toolchain.toml to use channel 1.91.0
- Enhance semantic classification for Windows paths with case-insensitivity and additional validation checks
- Add new test file for let chains example

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

---------

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement File Path Classification for POSIX, Windows, and Registry Paths

3 participants