feat(classification): implement file path classification for POSIX, Windows, and registry paths#121
Conversation
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
…d Windows - Added POSIX and Windows file path pattern matching - Included registry path detection - Comprehensive unit and integration tests added Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Implement POSIX, Windows, and UNC path classification - Add support for Windows registry path detection - Update documentation and tests for new capabilities Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Fix MSRV reference from 1.91+ to 1.85+ in copilot instructions - Reorganize classification docs for clarity - Increase classification test timeout for CI reliability Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 Walkthrough
WalkthroughAdds robust POSIX/Windows/UNC file-path and Windows registry detection to the SemanticClassifier (regex cache, validation and suspiciousness heuristics, new public classify_/is_* helpers). Integrates these into classify(), adds extensive integration tests, updates docs/examples, and introduces numerous onboarding, planning, and CI/configuration files. Changes
Sequence Diagram(s)(Skipped) Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR makes three minor improvements to the codebase: fixes the MSRV reference from an incorrect 1.91+ to the correct 1.85+, reorganizes classification documentation for better clarity, and increases the classification performance test timeout from 100ms to 500ms to improve CI reliability.
Changes:
- Updated MSRV references in AGENTS.md and copilot-instructions.md from 1.91+ to 1.85+
- Reorganized and expanded classification documentation with clearer explanations of patterns, validation rules, and implementation details
- Increased classification test timeout to 500ms to accommodate slower CI environments
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| AGENTS.md | New AI agent guidelines file documenting project standards and development patterns |
| CLAUDE.md | Pointer file referencing AGENTS.md for Claude Code integration |
| .claude/settings.json | Claude Code plugin configuration |
| .github/copilot-instructions.md | Comprehensive Copilot instructions including corrected MSRV of 1.85+ |
| .kiro/specs/stringy-binary-analyzer/tasks.md | Updated task tracking to mark path classification as complete |
| docs/src/classification.md | Reorganized documentation with expanded details on patterns, validation, and implementation |
| src/classification/mod.rs | Updated module documentation to reflect current capabilities |
| src/classification/semantic.rs | Large file (1530 lines) implementing path and registry classification with comprehensive tests |
| tests/classification_integration.rs | New integration test file with performance tests and snapshot testing |
| tests/snapshots/classification_integration__classification_snapshots.snap | Snapshot file for classification tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if self.classify_posix_path(&string.text).is_some() | ||
| || self.classify_windows_path(&string.text).is_some() | ||
| || self.classify_unc_path(&string.text).is_some() | ||
| { | ||
| tags.push(Tag::FilePath); | ||
| } | ||
|
|
There was a problem hiding this comment.
The documentation shows path classification logic that would prevent duplicate Tag::FilePath entries (lines 139-143), while the actual implementation (semantic.rs lines 383-394) could theoretically add Tag::FilePath multiple times if a string matched multiple path patterns. While path patterns should be mutually exclusive in practice, the documentation's approach is more defensive. Consider either updating the implementation to match the documentation's pattern, or updating the documentation to reflect the actual implementation.
| if self.classify_posix_path(&string.text).is_some() | |
| || self.classify_windows_path(&string.text).is_some() | |
| || self.classify_unc_path(&string.text).is_some() | |
| { | |
| tags.push(Tag::FilePath); | |
| } | |
| if self.classify_posix_path(&string.text).is_some() { | |
| tags.push(Tag::FilePath); | |
| } | |
| if self.classify_windows_path(&string.text).is_some() { | |
| tags.push(Tag::FilePath); | |
| } | |
| if self.classify_unc_path(&string.text).is_some() { | |
| tags.push(Tag::FilePath); | |
| } |
| } | ||
| let elapsed = start.elapsed(); | ||
|
|
||
| assert!(elapsed < Duration::from_millis(500)); |
There was a problem hiding this comment.
The test timeout has been increased from 100ms to 500ms (5x increase) for CI reliability. While this may help with flaky tests on slower CI runners, consider whether this masks a performance issue. The test processes 1050 samples (350 iterations × 3 samples each), and taking up to 500ms for this workload may indicate inefficient regex matching. Consider profiling the classification performance or adding a note about expected performance characteristics.
|
@copilot open a new pull request to apply changes based on the comments in this thread Run |
|
@unclesp1d3r I've opened a new pull request, #122, to work on those changes. Once the pull request is ready, I'll request review from you. |
…ion performance and consistency (#122) * Initial plan * fix: address code review feedback on path classification - Clarify Windows path separator validation in docs - Optimize registry path matching with zero-allocation case-insensitive search - Prevent duplicate Tag::FilePath entries - Add comment explaining 500ms performance test timeout Co-authored-by: unclesp1d3r <251112+unclesp1d3r@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: unclesp1d3r <251112+unclesp1d3r@users.noreply.github.com>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
This configuration file includes general settings, review guidelines, path filters, and instructions for various modules to streamline code reviews and maintain project standards. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In @.coderabbit.yml:
- Around line 254-256: Add the rust-version field to the Cargo.toml [package]
section to enforce the documented MSRV: add rust-version = "1.85" under the
existing [package] block (alongside fields like edition = "2024") so Cargo will
require Rust 1.85+; update the Cargo.toml [package] metadata and ensure the
value exactly matches "1.85".
In @.github/copilot-instructions.md:
- Around line 9-11: The fenced code block containing "Binary → Format Detection
(goblin) → Container Parsing → String Extraction → Deduplication →
Classification → Ranking → Output" needs a language specifier to satisfy MD040;
update the block in .github/copilot-instructions.md by changing the fence from
``` to a language-tagged fence (e.g., ```text) so the code block becomes ```text
... ``` and retains the same inner content.
- Around line 71-78: The markdown code block for the "Combined scoring formula"
is missing a language specifier; update the triple-backtick fence for that block
in the Combined scoring formula section to include a language tag (e.g., text)
so the block reads ```text and preserves plain formatting; locate the block by
the header "Combined scoring formula" or by the formula lines starting with
"base_score = max(occurrence.original_score)" and change the opening fence to
include the language specifier.
- Line 28: Replace the non-ASCII less-than-or-equal character '≤' in the
markdown heading "**File size limit**: Keep files ≤500-600 lines; split larger
files into focused modules" with an ASCII alternative (e.g., "<=" or the words
"no more than") to comply with the ASCII-only guideline; update the sentence to
read "**File size limit**: Keep files <=500-600 lines; split larger files into
focused modules" or "**File size limit**: Keep files no more than 500-600 lines;
split larger files into focused modules".
In `@src/classification/semantic.rs`:
- Around line 855-859: The code uppercases the input into root_upper via
to_ascii_uppercase() and then calls eq_ignore_ascii_case in the any() closure;
replace the redundant case-insensitive comparison with a simple equality check
against the already-uppercased string (change the any(|valid|
valid.eq_ignore_ascii_case(&root_upper)) usage to a direct equality comparison
such as any(|valid| valid == &root_upper) or equivalent), referencing root_upper
and VALID_REGISTRY_ROOTS and removing eq_ignore_ascii_case.
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Fix all issues with AI agents
In @.github/copilot-instructions.md:
- Around line 9-11: The documentation line containing "Binary → Format Detection
(goblin) → Container Parsing → String Extraction → Deduplication →
Classification → Ranking → Output" (and any similar occurrences such as the one
noted at 30-30) uses non-ASCII punctuation; replace all Unicode arrows (→) with
ASCII "->" and any em/en dashes (—/–) with a simple hyphen "-" so the file
(.github/copilot-instructions.md) is ASCII-only; search for the string "Binary →
Format Detection" and similar pipeline lines to update them consistently.
In `@Cargo.toml`:
- Line 5: CI is currently testing with dtolnay/rust-toolchain@1.91.0 while
Cargo.toml declares rust-version = "1.85", and rust-toolchain.toml uses channel
= "stable"; update the GitHub workflow(s) to include Rust 1.85 (either add a
matrix entry or a separate job that pins toolchain to 1.85) so the MSRV is
actually compiled in CI, and either pin rust-toolchain.toml to toolchain =
"1.85.0" (or a specific patch) or remove it and rely on the explicit CI
configuration to avoid channel drift.
In `@codebase_analysis.md`:
- Around line 30-32: The directory tree shows a machine-specific Windows path
"D:\Stringy\" which breaks portability; update the repo root in
codebase_analysis.md to a platform-neutral repo-relative root (e.g., "./" or
simply the project directory name) so the tree is portable across environments
and remove the Windows-specific backslash path string "D:\Stringy\" from the
document.
In `@src/classification/semantic.rs`:
- Around line 812-839: The Windows path validator is missing a rejection for
control characters and the registry validator lacks the 4096-length cap; update
is_valid_windows_path to immediately reject any presence of control chars (such
as '\0', '\n', '\r', and other ASCII C0 controls) before performing prefix/regex
checks, and add the same control-char check and a maximum length check (4096) to
the registry validation function (the function around lines 842-869, e.g.,
is_valid_registry_path or similarly named). Implement these checks as simple
early returns using a helper like contains_control_chars(text) or inline byte
scans so regex/prefix matching cannot be fooled by embedded controls; keep the
existing printf-placeholder and prefix logic afterwards.
- Around line 684-701: The classify_unc_path function currently returns
Some(Tag::FilePath) after only a regex check and non-empty server/share; extend
it to perform robust validation: after UNC_PATH_REGEX and extracting
server/share (in classify_unc_path), enforce max-length limits (e.g., <=255 per
component and overall reasonable limit), reject control characters (chars <=
0x1F or 0x7F), reject forward slashes '/' and printf-style placeholders ('%'
followed by format specifiers) anywhere in server or share, reject reserved
names or components consisting solely of dots, and ensure there are no empty
segments (detect double '\\' after the initial two by iterating over parts from
trimmed.split('\\') to confirm every segment is non-empty and valid). Only
return Some(Tag::FilePath) if all these extra validations pass; otherwise return
None.
- Around line 726-735: The path checks in is_suspicious_windows_path are
currently case-sensitive; make them case-insensitive by comparing lowercase
forms: compute a lowered version of text (e.g., text.to_ascii_lowercase()) and
for each SUSPICIOUS_WINDOWS_PATHS entry use its lowered form
(prefix.to_ascii_lowercase()) and then perform the same logic (if prefix starts
with '\\' use lowered_text.contains(lowered_prefix) else
lowered_text.starts_with(lowered_prefix)). Update references to
SUSPICIOUS_WINDOWS_PATHS and the starts_with/contains calls accordingly in
is_suspicious_windows_path.
- Around line 113-121: The two static regexes REGISTRY_PATH_REGEX and
REGISTRY_ABBREV_REGEX only accept uppercase roots, causing lowercase/mixed-case
registry roots to be rejected before is_valid_registry_path() runs; update both
Regex::new patterns to include the case-insensitive inline flag (?i) (e.g.
r"(?i)^HKEY_[A-Z_]+\\[^\x00\n\r]*" and r"(?i)^HK(LM|CU|CR|U|CC)\\[^\x00\n\r]*")
so they match roots case-insensitively while preserving the rest of the pattern
and behavior.
| name = "stringy" | ||
| version = "0.1.0" | ||
| edition = "2024" | ||
| rust-version = "1.85" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check rust-version in Cargo.toml
echo "=== Cargo.toml rust-version ==="
rg -n 'rust-version' Cargo.toml
# Find and display rust-toolchain files
echo -e "\n=== rust-toolchain files ==="
fd -a 'rust-toolchain' -x cat {}
# Check for .github/workflows for rust version pins
echo -e "\n=== GitHub CI configurations ==="
fd -a '\.yml$' '.github/workflows' 2>/dev/null | head -5 | xargs -I {} sh -c 'echo "File: {}"; rg "toolchain|rust-version|1\.8[0-9]|1\.9[0-9]" {} || true'
# Check for .circleci config
echo -e "\n=== CircleCI config ==="
fd 'config\.yml' '.circleci' 2>/dev/null | xargs -I {} sh -c 'echo "File: {}"; rg "toolchain|rust-version|1\.8[0-9]|1\.9[0-9]" {} || true'Repository: EvilBit-Labs/Stringy
Length of output: 951
CI toolchain must test against MSRV 1.85, not 1.91.0.
rust-version = "1.85" is correctly set, but the GitHub workflows are pinned to dtolnay/rust-toolchain@1.91.0. This creates a validation gap: MSRV compliance is never tested. The code could fail to compile on Rust 1.85 and the CI would not catch it. Additionally, rust-toolchain.toml uses an unpinned channel = "stable", which drifts over time.
Update CI workflows to test against 1.85 (add a matrix job or separate workflow), and pin rust-toolchain.toml to a fixed version or remove it in favor of explicit CI configuration.
🤖 Prompt for AI Agents
In `@Cargo.toml` at line 5, CI is currently testing with
dtolnay/rust-toolchain@1.91.0 while Cargo.toml declares rust-version = "1.85",
and rust-toolchain.toml uses channel = "stable"; update the GitHub workflow(s)
to include Rust 1.85 (either add a matrix entry or a separate job that pins
toolchain to 1.85) so the MSRV is actually compiled in CI, and either pin
rust-toolchain.toml to toolchain = "1.85.0" (or a specific patch) or remove it
and rely on the explicit CI configuration to avoid channel drift.
- Introduced detailed documentation for core user interactions in Stringy v1.0. - Added technical plan outlining architecture, design decisions, and integration strategies. - Included performance optimization ticket and end-to-end testing strategy. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@project_plan/specs/Core_Flows__Stringy_v1.0_User_Interactions.md`:
- Around line 85-93: The markdown has multiple fenced code blocks missing
language identifiers (e.g., the table starting with "String
Tags Score Section", the block containing
"https://malicious-c2.example.com/api", the YARA rule "rule binary_strings", and
the JSON blocks already noted), which prevents proper syntax highlighting;
update each fenced block in Core_Flows__Stringy_v1.0_User_Interactions.md to
include an appropriate language tag—use "text" for plain output/table blocks
(the String/Tags/Score table and the URL/plain output and summary blocks),
"json" for JSON blocks, and "yara" for YARA rules like "rule binary_strings"—so
every triple-backtick fence becomes ```text, ```json, or ```yara as appropriate.
In `@project_plan/specs/Epic_Brief__Stringy_v1.0_Completion.md`:
- Line 48: The heading "CLI Interface" is tautological; update the spec heading
to use either "CLI" or "Command-line interface" (not both) and ensure the
description beneath (file: src/main.rs) still lists the missing options
(--min-len, --enc, --only-tags, --notags, --top, --json) without the redundant
wording; specifically replace the heading "CLI Interface" with "CLI" or
"Command-line interface" and run a quick pass to remove any other instances of
the redundant phrase in this section.
In `@project_plan/specs/Technical_Plan__Stringy_v1.0_Implementation.md`:
- Around line 99-123: Update the fenced code blocks so they include language
specifiers: change the diagram block to start with ```mermaid and change all
Rust examples to start with ```rust (for example the blocks containing pub
struct FoundString and the other Rust snippets referenced in the document).
Locate the mermaid sequenceDiagram block and each Rust code snippet (e.g.,
declarations and examples like FoundString and the functions in the Rust
examples) and add the appropriate language identifier to the opening
triple-backtick fence to enable proper syntax highlighting.
- Around line 1-510: The SemanticClassifier implementation in
src/classification/semantic.rs is 1542 lines and must be split into a module
with submodules to respect the 500-line limit; refactor by creating
src/classification/semantic/mod.rs that exposes pub struct SemanticClassifier
and pub fn classify(...) and move logical groups of functions/regexes into new
submodules network.rs, filesystem.rs, identifiers.rs, and encoding.rs (each
implementing small helper functions called by SemanticClassifier::classify),
migrate the once_cell regex caches into the appropriate submodule scopes, add
mod declarations and pub use in semantic/mod.rs to re-export types and
functions, and update any imports that referenced symbols in the old file to
point to semantic::... so compilation and behavior (including SymbolDemangler
and RankingEngine callers) remain unchanged.
In `@project_plan/tickets/Implement_CLI_argument_parsing_and_validation.md`:
- Line 5: Replace the redundant phrase "CLI interface" in the ticket description
(the line containing "Complete the CLI interface with all filtering flags,
output format selection, and proper validation.") with a simplified wording such
as "Complete the CLI with all filtering flags, output format selection, and
proper validation." to remove the duplication; update the markdown heading/text
accordingly wherever that exact phrase appears in the document.
| ``` | ||
| String Tags Score Section | ||
| https://malicious-c2.example.com/api url 95 .rdata | ||
| C:\Windows\System32\kernel32.dll filepath 88 .rdata | ||
| core::fmt::Display::fmt export 85 .text | ||
| 192.168.1.100 ipv4 82 .rodata | ||
| {A1B2C3D4-E5F6-7890-ABCD-EF1234567890} guid 80 .data | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Add language identifiers to code blocks for proper rendering.
Multiple code blocks are missing language identifiers, which prevents proper syntax highlighting and may cause rendering issues. Examples, command outputs, and JSON blocks should specify their language.
📝 Suggested fixes
For example output blocks (like lines 85-93), add "text":
-```
+```text
String Tags Score SectionFor JSON blocks (like lines 176-179), ensure "json" is specified (already correct in the doc).
For YARA output (lines 213-230), add "yara":
-```
+```yara
rule binary_strings {For plain output (lines 269-275), add "text":
-```
+```text
https://malicious-c2.example.com/apiFor summary output (lines 301-311), add "text":
-```
+```text
[Results table here]Based on static analysis hints from markdownlint-cli2.
Also applies to: 176-179, 213-230, 269-275, 301-311
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
85-85: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@project_plan/specs/Core_Flows__Stringy_v1.0_User_Interactions.md` around
lines 85 - 93, The markdown has multiple fenced code blocks missing language
identifiers (e.g., the table starting with "String
Tags Score Section", the block containing
"https://malicious-c2.example.com/api", the YARA rule "rule binary_strings", and
the JSON blocks already noted), which prevents proper syntax highlighting;
update each fenced block in Core_Flows__Stringy_v1.0_User_Interactions.md to
include an appropriate language tag—use "text" for plain output/table blocks
(the String/Tags/Score table and the URL/plain output and summary blocks),
"json" for JSON blocks, and "yara" for YARA rules like "rule binary_strings"—so
every triple-backtick fence becomes ```text, ```json, or ```yara as appropriate.
|
|
||
| 4. **Output Formatting** (file:src/output/): Only basic interfaces exist. Users cannot get JSONL for automation, human-readable tables for manual review, or YARA-friendly output for rule creation. | ||
|
|
||
| 5. **CLI Interface** (file:src/main.rs): Missing filtering options (--min-len, --enc, --only-tags, --notags, --top) and output format selection (--json). |
There was a problem hiding this comment.
Tautological phrasing (“CLI Interface”).
“CLI” already means “Command Line Interface.” Consider “CLI” or “command-line interface” to avoid redundancy.
🧰 Tools
🪛 LanguageTool
[style] ~48-~48: This phrase is redundant (‘I’ stands for ‘Interface’). Use simply “CLIInterface”.
Context: ...riendly output for rule creation. 5. CLI Interface (file:src/main.rs): Missing filtering...
(ACRONYM_TAUTOLOGY)
🤖 Prompt for AI Agents
In `@project_plan/specs/Epic_Brief__Stringy_v1.0_Completion.md` at line 48, The
heading "CLI Interface" is tautological; update the spec heading to use either
"CLI" or "Command-line interface" (not both) and ensure the description beneath
(file: src/main.rs) still lists the missing options (--min-len, --enc,
--only-tags, --notags, --top, --json) without the redundant wording;
specifically replace the heading "CLI Interface" with "CLI" or "Command-line
interface" and run a quick pass to remove any other instances of the redundant
phrase in this section.
| # Technical Plan: Stringy v1.0 Implementation | ||
|
|
||
| ## Overview | ||
|
|
||
| This technical plan defines the architecture for completing Stringy v1.0, building on the existing foundation of format detection, container parsing, and string extraction. The implementation adds semantic classification, intelligent ranking, symbol demangling, flexible output formatting, and CLI orchestration. | ||
|
|
||
| ## Architectural Approach | ||
|
|
||
| ### Core Design Decisions | ||
|
|
||
| **1. Pipeline Architecture** | ||
|
|
||
| The main execution flow will be encapsulated in a `Pipeline` struct that orchestrates the entire analysis process. This provides: | ||
|
|
||
| - Clear entry point for the analysis workflow | ||
| - Centralized error handling and recovery | ||
| - Testability through dependency injection | ||
| - Progress tracking integration | ||
|
|
||
| **Trade-off**: Struct-based approach adds a layer of abstraction but provides better testability and maintainability compared to functional composition in main(). | ||
|
|
||
| **2. Enum-Based Output Formatting** | ||
|
|
||
| Output formatters will use an enum-based approach with a single format() function that matches on the output type. This provides: | ||
|
|
||
| - Simplicity and directness for the 3 required formats | ||
| - Easy to understand and maintain | ||
| - No trait abstraction overhead | ||
| - Sufficient for current requirements | ||
|
|
||
| **Trade-off**: Less extensible than trait-based approach, but simpler and more appropriate for the limited number of formatters. Future formats can be added as enum variants. | ||
|
|
||
| **3. Memory-Mapped File I/O with Fallback** | ||
|
|
||
| File reading will attempt memory mapping first, with automatic fallback to regular file reading on failure. This provides: | ||
|
|
||
| - Efficient memory-mapped access for most cases | ||
| - Robustness for edge cases (network filesystems, locked files, platform limitations) | ||
| - Consistent behavior across all file sizes | ||
| - Zero-copy access when possible | ||
|
|
||
| **Trade-off**: Slightly more complex than always-on mmap, but handles real-world failure scenarios gracefully. | ||
|
|
||
| **4. Modern Regex Caching** | ||
|
|
||
| Migrate from `lazy_static` to `once_cell` for regex pattern caching. This provides: | ||
|
|
||
| - More modern, ergonomic API | ||
| - Better compile-time guarantees | ||
| - Consistent with Rust ecosystem trends | ||
| - Simpler initialization patterns | ||
|
|
||
| **Trade-off**: Requires dependency migration but improves code quality and maintainability. | ||
|
|
||
| **5. Rich Progress Feedback** | ||
|
|
||
| Use `indicatif` library for progress indicators. This provides: | ||
|
|
||
| - Professional progress bars and spinners | ||
| - Automatic TTY detection | ||
| - Minimal code for rich feedback | ||
| - Consistent user experience | ||
|
|
||
| **Trade-off**: Adds external dependency but provides significantly better UX than manual eprintln! calls. | ||
|
|
||
| ### Technical Constraints | ||
|
|
||
| **Codebase Standards:** | ||
|
|
||
| - No `unsafe` code (`#![forbid(unsafe_code)]` enforced) | ||
| - Zero warnings (`cargo clippy -- -D warnings` must pass) | ||
| - ASCII-only text (no Unicode punctuation or emojis) | ||
| - File size limit: 500 lines per file (split larger files) | ||
| - No blanket `#[allow]` attributes | ||
|
|
||
| **Module Organization:** | ||
|
|
||
| - file:src/classification/ - Semantic analysis and ranking | ||
| - file:src/output/ - Output formatters | ||
| - file:src/main.rs - CLI and pipeline orchestration | ||
| - file:src/lib.rs - Public API and re-exports | ||
|
|
||
| **Error Handling:** | ||
|
|
||
| - Use `thiserror` for all error types | ||
| - Include context (offsets, section names, file paths) | ||
| - Graceful degradation where possible | ||
| - Clear error messages for user-facing errors | ||
|
|
||
| ### Integration Strategy | ||
|
|
||
| The new components integrate with existing infrastructure: | ||
|
|
||
| 1. **Classification Integration**: New semantic patterns and symbol demangling extend existing `SemanticClassifier` in file:src/classification/semantic.rs | ||
| 2. **Ranking Integration**: New `RankingEngine` in src/classification/ranking.rs consumes `FoundString` objects with section weights from container parsers | ||
| 3. **Output Integration**: New formatters in file:src/output/ consume ranked `Vec<FoundString>` from pipeline | ||
| 4. **CLI Integration**: file:src/main.rs orchestrates all components through `Pipeline` struct | ||
|
|
||
| ```mermaid | ||
| sequenceDiagram | ||
| participant CLI as main.rs | ||
| participant Pipeline | ||
| participant Container as container/* | ||
| participant Extractor as extraction/* | ||
| participant Classifier as classification/* | ||
| participant Ranker as ranking.rs | ||
| participant Formatter as output/* | ||
|
|
||
| CLI->>Pipeline: new(config) | ||
| CLI->>Pipeline: run(file_path) | ||
| Pipeline->>Container: detect_format() & parse() | ||
| Container-->>Pipeline: ContainerInfo | ||
| Pipeline->>Extractor: extract(data, container_info) | ||
| Extractor-->>Pipeline: Vec<FoundString> | ||
| Pipeline->>Classifier: classify(strings) | ||
| Classifier->>Classifier: demangle_symbols() | ||
| Classifier-->>Pipeline: Vec<FoundString> with tags | ||
| Pipeline->>Ranker: calculate_scores(strings) | ||
| Ranker-->>Pipeline: Vec<FoundString> with scores | ||
| Pipeline->>Pipeline: apply_filters() & sort() | ||
| Pipeline->>Formatter: format(strings) | ||
| Formatter-->>CLI: Output to stdout | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Data Model | ||
|
|
||
| ### FoundString Enhancement | ||
|
|
||
| Extend the existing `FoundString` struct in file:src/types.rs to include score breakdown for transparency and debugging: | ||
|
|
||
| ```rust | ||
| pub struct FoundString { | ||
| // Existing fields | ||
| pub text: String, | ||
| pub encoding: Encoding, | ||
| pub offset: u64, | ||
| pub rva: Option<u64>, | ||
| pub section: Option<String>, | ||
| pub length: u32, | ||
| pub tags: Vec<Tag>, | ||
| pub score: i32, // Final calculated score | ||
| pub source: StringSource, | ||
| pub confidence: f32, | ||
|
|
||
| // New fields for symbol demangling | ||
| pub original_text: Option<String>, // Original mangled form (if demangled) | ||
|
|
||
| // Optional debug fields (only populated with --debug flag) | ||
| pub section_weight: Option<i32>, // Score from section type | ||
| pub semantic_boost: Option<i32>, // Bonus from semantic tags | ||
| pub noise_penalty: Option<i32>, // Penalty from noise detection | ||
| } | ||
| ``` | ||
|
|
||
| **Rationale**: | ||
|
|
||
| - `original_text` preserves the mangled symbol for cross-referencing and recovery | ||
| - Breakdown fields (section_weight, semantic_boost, noise_penalty) are optional to avoid exposing internal implementation details in the public API | ||
| - With --debug flag, breakdown fields are populated for debugging and validation | ||
| - Without --debug, breakdown fields remain None, keeping the API simple and flexible | ||
|
|
||
| ### Ranking Configuration | ||
|
|
||
| New configuration struct for ranking parameters in src/classification/ranking.rs: | ||
|
|
||
| ```rust | ||
| pub struct RankingConfig { | ||
| pub section_weights: HashMap<SectionType, i32>, | ||
| pub tag_boosts: HashMap<Tag, i32>, | ||
| pub noise_penalty_config: NoisePenaltyConfig, | ||
| } | ||
|
|
||
| pub struct NoisePenaltyConfig { | ||
| pub high_entropy_penalty: i32, | ||
| pub excessive_length_penalty: i32, | ||
| pub repeated_pattern_penalty: i32, | ||
| } | ||
| ``` | ||
|
|
||
| **Integration**: `RankingConfig` uses hardcoded sensible defaults. No user configuration is provided - the defaults are designed to work well across all use cases (malware analysis, reverse engineering, general analysis). | ||
|
|
||
| ### Filter Configuration | ||
|
|
||
| New configuration struct for CLI filtering in file:src/main.rs: | ||
|
|
||
| ```rust | ||
| pub struct FilterConfig { | ||
| pub min_length: Option<usize>, | ||
| pub encodings: Option<Vec<Encoding>>, | ||
| pub include_tags: Option<Vec<Tag>>, | ||
| pub exclude_tags: Option<Vec<Tag>>, | ||
| pub top_n: Option<usize>, | ||
| } | ||
| ``` | ||
|
|
||
| **Integration**: Built from CLI arguments, passed to Pipeline. Pipeline applies filters after ranking using iterator adapters. | ||
|
|
||
| ### Output Formatter Interface | ||
|
|
||
| Trait definition for output formatters in file:src/output/mod.rs: | ||
|
|
||
| ```rust | ||
| pub enum OutputFormat { | ||
| Table, | ||
| Json, | ||
| Yara, | ||
| } | ||
|
|
||
| pub struct OutputMetadata { | ||
| pub binary_name: String, | ||
| pub binary_format: BinaryFormat, | ||
| pub total_strings: usize, | ||
| pub filtered_strings: usize, | ||
| } | ||
|
|
||
| pub fn format_output( | ||
| format: OutputFormat, | ||
| strings: &[FoundString], | ||
| metadata: &OutputMetadata, | ||
| ) -> Result<String>; | ||
| ``` | ||
|
|
||
| **Rationale**: Enum-based design is simpler and more direct for the three required formats. The format_output function matches on the enum and delegates to format-specific logic. Metadata provides context for formatters to include summary information. | ||
|
|
||
| ### Tag Enum Extensions | ||
|
|
||
| Add new variants to existing `Tag` enum in file:src/types.rs with specificity levels: | ||
|
|
||
| ```rust | ||
| pub enum Tag { | ||
| // Existing tags (specific) | ||
| Url, Domain, IPv4, IPv6, FilePath, RegistryPath, | ||
| Import, Export, Version, Manifest, Resource, | ||
| DylibPath, Rpath, RpathVariable, FrameworkPath, | ||
|
|
||
| // New specific tags for v1.0 | ||
| Guid, // GUIDs/UUIDs (specific) | ||
| Email, // Email addresses (specific) | ||
| FormatString, // Printf-style format strings (specific) | ||
| UserAgent, // User agent strings (specific) | ||
| DemangledSymbol, // Demangled Rust/C++ symbols (specific) | ||
|
|
||
| // Broad/ambiguous tags | ||
| Base64, // Base64-encoded data (broad - many false positives) | ||
| } | ||
| ``` | ||
|
|
||
| **Tag Specificity**: Tags are categorized as specific (high confidence, low false positives) or broad (lower confidence, higher false positives). A string can have multiple tags. Specific tags like Email are prioritized over broad tags like Base64 in display and ranking. | ||
|
|
||
| **Integration**: New tags follow existing pattern. Classification logic in file:src/classification/semantic.rs will be extended to detect these patterns. False negatives are worse than false positives - we prefer to tag liberally. | ||
|
|
||
| --- | ||
|
|
||
| ## Component Architecture | ||
|
|
||
| ### 1. Ranking Engine (src/classification/ranking.rs) | ||
|
|
||
| **Purpose**: Calculate relevance scores for strings based on multiple factors. | ||
|
|
||
| **Interface**: | ||
|
|
||
| ```rust | ||
| pub struct RankingEngine { | ||
| config: RankingConfig, | ||
| } | ||
|
|
||
| impl RankingEngine { | ||
| pub fn new(config: RankingConfig) -> Self; | ||
| pub fn calculate_score(&self, string: &mut FoundString); | ||
| pub fn rank_strings(&self, strings: &mut [FoundString]); | ||
| } | ||
| ``` | ||
|
|
||
| **Responsibilities**: | ||
|
|
||
| - Apply section weight scoring based on `SectionType` | ||
| - Apply semantic boost scoring based on tags | ||
| - Calculate noise penalties from confidence scores | ||
| - Populate score breakdown fields | ||
| - Sort strings by final score | ||
|
|
||
| **Integration**: Consumes `FoundString` objects after classification, uses section weights from `ContainerInfo`, applies tag-based boosts. | ||
|
|
||
| **File Size**: Keep under 500 lines. If scoring logic exceeds limit, split into src/classification/ranking/mod.rs with submodules for section_weights.rs, semantic_boosts.rs, noise_penalties.rs. | ||
|
|
||
| ### 2. Symbol Demangling (src/classification/symbols.rs) | ||
|
|
||
| **Purpose**: Demangle Rust and C++ symbols to human-readable form while preserving original. | ||
|
|
||
| **Interface**: | ||
|
|
||
| ```rust | ||
| pub struct SymbolDemangler { | ||
| // Uses rustc-demangle crate | ||
| } | ||
|
|
||
| impl SymbolDemangler { | ||
| pub fn new() -> Self; | ||
| pub fn demangle(&self, string: &mut FoundString); | ||
| pub fn is_mangled(&self, symbol: &str) -> bool; | ||
| } | ||
| ``` | ||
|
|
||
| **Responsibilities**: | ||
|
|
||
| - Detect mangled Rust symbols (starts with `_ZN` or `_R`) | ||
| - Demangle using `rustc-demangle` crate | ||
| - Store original mangled form in `FoundString.original_text` | ||
| - Replace `FoundString.text` with demangled version | ||
| - Tag demangled symbols with `DemangledSymbol` tag | ||
| - Handle demangling failures gracefully (keep original text, no tag) | ||
|
|
||
| **Integration**: Called during classification phase. Processes strings with `Import` or `Export` tags. Modifies FoundString in-place, preserving original in original_text field. | ||
|
|
||
| **Dependency**: Add `rustc-demangle` to Cargo.toml. | ||
|
|
||
| ### 3. Semantic Classification Extensions (file:src/classification/semantic.rs) | ||
|
|
||
| **Purpose**: Extend existing classifier with new pattern detection. | ||
|
|
||
| **New Patterns**: | ||
|
|
||
| - **GUID**: Regex for standard GUID format `{XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}` (specific) | ||
| - **Email**: Regex with basic validation ([user@domain.tld](mailto:user@domain.tld)) (specific) | ||
| - **Base64**: Pattern detection for Base64-encoded data (length, character set) (broad - many false positives) | ||
| - **Format String**: Detection of printf-style format specifiers (%s, %d, %x, etc.) (specific) | ||
| - **User Agent**: Pattern matching for common user agent strings (specific) | ||
|
|
||
| **Tag Specificity Strategy**: | ||
|
|
||
| - Apply all matching patterns - a string can have multiple tags | ||
| - Prefer false positives over false negatives (better to tag liberally) | ||
| - Specific tags (Email, GUID, FormatString) have higher confidence | ||
| - Broad tags (Base64) are applied with lower confidence but still useful | ||
| - Tag priority for display handles showing most relevant tags first | ||
|
|
||
| **Integration**: Extend existing `SemanticClassifier::classify()` method with new pattern checks. Use `once_cell` for regex caching. | ||
|
|
||
| **File Size**: Current file is approaching 500 lines. If additions exceed limit, split into src/classification/semantic/mod.rs with submodules for network.rs, filesystem.rs, identifiers.rs, encoding.rs. | ||
|
|
||
| ### 4. Output Formatters (file:src/output/) | ||
|
|
||
| **Module Structure**: | ||
|
|
||
| - file:src/output/mod.rs - OutputFormat enum and format_output() function | ||
| - src/output/table.rs - Table formatting logic | ||
| - src/output/json.rs - JSONL formatting logic | ||
| - src/output/yara.rs - YARA rule generation logic | ||
|
|
||
| **Table Formatter** (src/output/table.rs): | ||
|
|
||
| - Detect TTY vs non-TTY output using `atty` or `std::io::IsTerminal` | ||
| - TTY: Format as aligned table with columns (String | Tags | Score | Section) | ||
| - Non-TTY: Output plain strings, one per line | ||
| - Handle long strings (show in full, terminal wraps) | ||
| - Show primary tags (comma-separated if multiple at same priority) | ||
|
|
||
| **JSON Formatter** (src/output/json.rs): | ||
|
|
||
| - Output JSONL (one JSON object per line) | ||
| - Include all `FoundString` fields (text, encoding, offset, rva, section, length, tags, score, source, confidence) | ||
| - Include original_text if present (demangled symbols) | ||
| - Include score breakdown fields only if populated (--debug mode) | ||
| - Proper escaping for JSON strings via serde_json | ||
|
|
||
| **YARA Formatter** (src/output/yara.rs): | ||
|
|
||
| - Generate complete YARA rule template | ||
| - Sanitize binary filename for rule name (replace non-alphanumeric with underscore, remove extension, add "_strings" suffix) | ||
| - Include metadata section (description, tool, date, file hash) | ||
| - Escape strings according to YARA syntax (backslashes, quotes, newlines) | ||
| - Skip strings over 200 characters with comment: "// Skipped: too long (N chars)" | ||
| - Include both `ascii` and `wide` modifiers for compatibility | ||
|
|
||
| **Integration**: Pipeline calls format_output() with selected OutputFormat enum variant. The function matches on the enum and delegates to the appropriate formatting module. | ||
|
|
||
| ### 5. Pipeline Orchestration (file:src/main.rs) | ||
|
|
||
| **Purpose**: Coordinate the entire analysis workflow including filtering. | ||
|
|
||
| **Structure**: | ||
|
|
||
| ```rust | ||
| pub struct Pipeline { | ||
| config: PipelineConfig, | ||
| progress: ProgressBar, // from indicatif | ||
| } | ||
|
|
||
| pub struct PipelineConfig { | ||
| extraction_config: ExtractionConfig, | ||
| ranking_config: RankingConfig, | ||
| filter_config: FilterConfig, | ||
| debug_mode: bool, | ||
| } | ||
|
|
||
| impl Pipeline { | ||
| pub fn new(config: PipelineConfig) -> Self; | ||
| pub fn run(&mut self, file_path: &Path) -> Result<Vec<FoundString>>; | ||
| } | ||
| ``` | ||
|
|
||
| **Workflow**: | ||
|
|
||
| 1. Display "Parsing..." progress indicator | ||
| 2. Attempt memory-map file using `memmap2`, fall back to `std::fs::read()` on failure | ||
| 3. Detect format and parse container (fail fast on error) | ||
| 4. Display "Extracting..." progress indicator | ||
| 5. Extract strings using `BasicExtractor` (fail fast on critical errors) | ||
| 6. Display "Classifying..." progress indicator | ||
| 7. Apply semantic classification (graceful degradation - skip failed strings) | ||
| 8. Apply symbol demangling (graceful degradation - keep original on failure) | ||
| 9. Display "Ranking..." progress indicator | ||
| 10. Calculate scores using `RankingEngine` (populate breakdown fields if debug_mode) | ||
| 11. Apply filters from FilterConfig (min-len, encoding, tags) | ||
| 12. Sort by score and apply --top limit | ||
| 13. Format output using selected OutputFormat enum | ||
| 14. Write to stdout | ||
|
|
||
| **Error Handling Strategy**: | ||
|
|
||
| *Critical Stages (fail fast):* | ||
|
|
||
| - File access: Exit with error if file not found or cannot be read | ||
| - Format detection: Exit with error if format unsupported | ||
| - Container parsing: Exit with error if binary is corrupted or invalid | ||
|
|
||
| *Optional Stages (graceful degradation):* | ||
|
|
||
| - Classification: If classification fails on individual strings, skip those strings and continue | ||
| - Symbol demangling: If demangling fails, keep original text and continue | ||
| - Ranking: If ranking fails, output unranked strings with warning | ||
| - Formatting: If primary formatter fails, attempt plain text fallback | ||
|
|
||
| *Recovery Strategy*: | ||
|
|
||
| - Memory mapping failure: Automatically fall back to regular file reading | ||
| - Partial results: If some strings are processed successfully, output them with warning about failures | ||
| - No strings found: Display informational message to stderr, exit 0 (not an error) | ||
|
|
||
| **Progress Feedback**: Use `indicatif::ProgressBar` with spinner style. Progress messages go to stderr, results to stdout. | ||
|
|
||
| **CLI Filtering**: Filter logic is part of `Pipeline::run()` to keep main.rs under 500-line limit: | ||
|
|
||
| - Pipeline receives filter configuration from CLI args | ||
| - Filters applied after ranking, before output formatting | ||
| - Uses iterator adapters: `filter()` for criteria, `take()` for --top | ||
| - Filter validation happens during Pipeline initialization | ||
|
|
||
| **Integration**: Pipeline owns the entire process including filtering. CLI argument parsing uses `clap` derive macros. Main function is minimal: parse args, create Pipeline, call run(), handle output. | ||
|
|
||
| ### 6. Performance Optimizations | ||
|
|
||
| **Memory Mapping** (file:src/main.rs): | ||
|
|
||
| - Attempt `memmap2::Mmap` first for efficient access | ||
| - On mmap failure (network filesystem, locked file, platform limitations), fall back to `std::fs::read()` | ||
| - Pass byte slice to container parsers (works with both mmap and regular read) | ||
| - Log fallback to regular file reading for user awareness | ||
|
|
||
| **Regex Caching** (file:src/classification/semantic.rs): | ||
|
|
||
| - Migrate from `lazy_static` to `once_cell::sync::Lazy` | ||
| - Pre-compile all regex patterns at first use | ||
| - Share compiled patterns across all classification calls | ||
|
|
||
| **Dependency Additions**: | ||
|
|
||
| - `memmap2` - Memory-mapped file I/O with fallback | ||
| - `once_cell` - Modern lazy initialization (migrate from lazy_static) | ||
| - `indicatif` - Progress bars and spinners | ||
| - `rustc-demangle` - Rust symbol demangling | ||
| - `atty` or use `std::io::IsTerminal` - TTY detection for output formatting | ||
|
|
||
| ### Integration Points Summary | ||
|
|
||
|
|
||
| | Component | Consumes | Produces | Integration Point | | ||
| | ------------------ | ------------------------------------ | ----------------------------- | ----------------------------------------------- | | ||
| | Pipeline | CLI args, file path | Formatted output | Orchestrates all components + filtering | | ||
| | RankingEngine | Vec<FoundString>, debug flag | Scored Vec<FoundString> | Called after classification | | ||
| | SymbolDemangler | &mut FoundString | () | Called during classification, modifies in-place | | ||
| | SemanticClassifier | FoundString | Vec<Tag> | Extended with new patterns | | ||
| | format_output() | OutputFormat, Vec<FoundString> | String | Enum-based dispatch to formatters | | ||
|
|
||
|
|
||
| ### Testing Strategy | ||
|
|
||
| **Unit Tests**: | ||
|
|
||
| - Ranking: Test score calculation with known inputs | ||
| - Symbol demangling: Test with mangled/unmangled symbols | ||
| - Semantic patterns: Test each new pattern with positive/negative cases | ||
| - Output formatters: Test with sample data, verify format correctness | ||
|
|
||
| **Integration Tests**: | ||
|
|
||
| - End-to-end pipeline tests with fixture binaries | ||
| - CLI argument parsing and filtering | ||
| - Output format validation with `insta` snapshots | ||
| - Error handling scenarios | ||
|
|
||
| **Benchmarks**: | ||
|
|
||
| - Ranking performance with large string sets | ||
| - Regex pattern matching performance | ||
| - Memory mapping vs regular file I/O | ||
| - Overall pipeline throughput | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "semantic.rs" -type f | head -5Repository: EvilBit-Labs/Stringy
Length of output: 96
🏁 Script executed:
wc -l src/classification/semantic.rsRepository: EvilBit-Labs/Stringy
Length of output: 99
Split src/classification/semantic.rs to meet the 500-line file size constraint.
The semantic classifier is 1542 lines, exceeding the 500-line limit defined in the technical plan (lines 73–74). The plan explicitly documents the splitting strategy at line 341: split into src/classification/semantic/mod.rs with submodules for network.rs, filesystem.rs, identifiers.rs, and encoding.rs. Refactor accordingly to maintain architectural consistency and improve maintainability.
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
125-125: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
59-59: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
317-317: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
85-85: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
221-221: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
277-277: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
311-311: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
32-32: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
11-11: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
22-22: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
33-33: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
44-44: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
55-55: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
485-485: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
125-125: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In `@project_plan/specs/Technical_Plan__Stringy_v1.0_Implementation.md` around
lines 1 - 510, The SemanticClassifier implementation in
src/classification/semantic.rs is 1542 lines and must be split into a module
with submodules to respect the 500-line limit; refactor by creating
src/classification/semantic/mod.rs that exposes pub struct SemanticClassifier
and pub fn classify(...) and move logical groups of functions/regexes into new
submodules network.rs, filesystem.rs, identifiers.rs, and encoding.rs (each
implementing small helper functions called by SemanticClassifier::classify),
migrate the once_cell regex caches into the appropriate submodule scopes, add
mod declarations and pub use in semantic/mod.rs to re-export types and
functions, and update any imports that referenced symbols in the old file to
point to semantic::... so compilation and behavior (including SymbolDemangler
and RankingEngine callers) remain unchanged.
| ```mermaid | ||
| sequenceDiagram | ||
| participant CLI as main.rs | ||
| participant Pipeline | ||
| participant Container as container/* | ||
| participant Extractor as extraction/* | ||
| participant Classifier as classification/* | ||
| participant Ranker as ranking.rs | ||
| participant Formatter as output/* | ||
|
|
||
| CLI->>Pipeline: new(config) | ||
| CLI->>Pipeline: run(file_path) | ||
| Pipeline->>Container: detect_format() & parse() | ||
| Container-->>Pipeline: ContainerInfo | ||
| Pipeline->>Extractor: extract(data, container_info) | ||
| Extractor-->>Pipeline: Vec<FoundString> | ||
| Pipeline->>Classifier: classify(strings) | ||
| Classifier->>Classifier: demangle_symbols() | ||
| Classifier-->>Pipeline: Vec<FoundString> with tags | ||
| Pipeline->>Ranker: calculate_scores(strings) | ||
| Ranker-->>Pipeline: Vec<FoundString> with scores | ||
| Pipeline->>Pipeline: apply_filters() & sort() | ||
| Pipeline->>Formatter: format(strings) | ||
| Formatter-->>CLI: Output to stdout | ||
| ``` |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Minor: Add language specifiers to code blocks.
Several code blocks are missing language specifiers, which affects syntax highlighting and readability. The mermaid diagram (lines 99-123) and Rust code examples (lines 133-154, 168-179, 204-223, 264-273, 294-303, 386-401) should have appropriate language identifiers.
Example fix for Rust code blocks
Change from:
```
pub struct FoundString {To:
```rust
pub struct FoundString {Also applies to: 133-154, 168-179, 204-223, 264-273, 294-303, 386-401
🤖 Prompt for AI Agents
In `@project_plan/specs/Technical_Plan__Stringy_v1.0_Implementation.md` around
lines 99 - 123, Update the fenced code blocks so they include language
specifiers: change the diagram block to start with ```mermaid and change all
Rust examples to start with ```rust (for example the blocks containing pub
struct FoundString and the other Rust snippets referenced in the document).
Locate the mermaid sequenceDiagram block and each Rust code snippet (e.g.,
declarations and examples like FoundString and the functions in the Rust
examples) and add the appropriate language identifier to the opening
triple-backtick fence to enable proper syntax highlighting.
|
|
||
| ## Objective | ||
|
|
||
| Complete the CLI interface with all filtering flags, output format selection, and proper validation. |
There was a problem hiding this comment.
Minor: Redundant phrasing.
"CLI interface" is redundant since CLI already stands for "Command Line Interface". Consider simplifying to "Complete the CLI with all filtering flags..."
📝 Suggested fix
-Complete the CLI interface with all filtering flags, output format selection, and proper validation.
+Complete the CLI with all filtering flags, output format selection, and proper validation.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Complete the CLI interface with all filtering flags, output format selection, and proper validation. | |
| Complete the CLI with all filtering flags, output format selection, and proper validation. |
🧰 Tools
🪛 LanguageTool
[style] ~5-~5: This phrase is redundant (‘I’ stands for ‘interface’). Use simply “CLI”.
Context: ... validation ## Objective Complete the CLI interface with all filtering flags, output format...
(ACRONYM_TAUTOLOGY)
🤖 Prompt for AI Agents
In `@project_plan/tickets/Implement_CLI_argument_parsing_and_validation.md` at
line 5, Replace the redundant phrase "CLI interface" in the ticket description
(the line containing "Complete the CLI interface with all filtering flags,
output format selection, and proper validation.") with a simplified wording such
as "Complete the CLI with all filtering flags, output format selection, and
proper validation." to remove the duplication; update the markdown heading/text
accordingly wherever that exact phrase appears in the document.
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Adjusted numbering format in user interaction flows for consistency. - Enhanced clarity in the Epic Brief by refining problem statements. - Improved formatting in the Technical Plan for better readability. - Updated performance optimization ticket to reflect new dependencies. - Enhanced clarity in the pipeline orchestration ticket scope. - Refined semantic classification ticket to include additional patterns. - Improved integration testing ticket to cover more binary types. - Updated FoundString data model ticket for clarity on new fields. - Enhanced CLI argument parsing ticket to include all flags. - Improved output formatters ticket to clarify scope and dependencies. - Refined ranking system ticket to ensure clarity on scoring logic. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Bump rust-version in Cargo.toml to 1.91 - Update rust-toolchain.toml to use channel 1.91.0 - Enhance semantic classification for Windows paths with case-insensitivity and additional validation checks - Add new test file for let chains example Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Fix all issues with AI agents
In `@project_plan/specs/Epic_Brief__Stringy_v1.0_Completion.md`:
- Around line 11-27: Replace bold pseudo-headings (e.g., "**Primary Users:**"
and "**Problem 1: Signal vs Noise**", "**Problem 2: Cryptic Symbols**",
"**Problem 3: No Prioritization**") with proper Markdown headings (for example
"### Primary Users" and "#### Problem 1: Signal vs Noise") so they are
recognized as real headings by markdownlint MD036; update each similar bold
label in the file to an appropriate heading level consistent with existing
structure (e.g., "###" for section titles, "####" for sub-sections) and remove
the surrounding bold markers.
In
`@project_plan/tickets/Add_performance_optimizations_and_dependency_updates.md`:
- Around line 9-28: The document uses bold labels "**In Scope:**" and "**Out of
Scope:**" instead of markdown headings; update the file to use proper markdown
headings (e.g., "## In Scope" and "## Out of Scope") and convert the subsection
lists under them to be clearly nested under those headings so the structure is
semantic and parsable; ensure the header level matches the document hierarchy
(use "###" for sublists if needed) and preserve all listed items (memmap2,
once_cell, indicatif, rustc-demangle, memory mapping fallback behavior,
lazy_static migration, indicatif integration, benchmarks, docs update, and the
out-of-scope notes) while removing the bolded label style.
In `@project_plan/tickets/End-to-end_integration_testing_and_validation.md`:
- Around line 9-31: Replace the bolded section labels "**In Scope:**" and "**Out
of Scope:**" with proper Markdown headings (for example "## In Scope" and "##
Out of Scope") in the document so they are real headings instead of emphasized
text; update the two occurrences matching the literal phrases "In Scope:" and
"Out of Scope:" in the file to use heading syntax and keep the existing lists
and content unchanged to satisfy markdownlint MD036.
In
`@project_plan/tickets/Extend_FoundString_data_model_for_demangling_and_debug_support.md`:
- Around line 9-21: Replace the bold label lines "**In Scope:**" and "**Out of
Scope:**" with proper markdown headings (e.g., "## In Scope" and "## Out of
Scope") so they conform to markdownlint MD036; update the surrounding lists
under the headings as needed to preserve the same content and spacing and ensure
serialization/tests/docs references to these sections still read the new heading
text.
In
`@project_plan/tickets/Implement_enum-based_output_formatters_`(Table,_JSON,_YARA).md:
- Around line 9-36: Replace the bold labels "**In Scope:**" and "**Out of
Scope:**" with proper markdown headings (e.g., "## In Scope" and "## Out of
Scope") so the document uses headings instead of bold text for section titles;
update the file's sections where the enum-based output formatter plan defines
scope and out-of-scope items (the sections currently titled "In Scope:" and "Out
of Scope:") and ensure any similar bolded section labels in that file are
converted to appropriate heading levels for compliance with MD036.
In `@src/classification/semantic.rs`:
- Around line 171-180: KNOWN_WINDOWS_PREFIXES currently only lists C:\ paths so
paths on other drive letters won't match; update the heuristic by either
replacing the fixed HashSet with a drive-letter pattern check (e.g., test for a
leading "[A-Za-z]:\" prefix) or expand the set to include other drive letters
(generate "D:\\", "E:\\", etc.), and update any callers that rely on
KNOWN_WINDOWS_PREFIXES accordingly (reference: KNOWN_WINDOWS_PREFIXES in
src/classification/semantic.rs).
- Around line 829-842: The contains_printf_placeholder function currently only
detects a small set of single-letter specifiers; update it to also catch
security-relevant specifiers like %p and %n (and optionally other common
specifiers such as %i, %c, %e, %g and length-augmented forms like %ld/%lld).
Modify contains_printf_placeholder to, after seeing '%', consume any
flags/width/precision/length tokens (or switch to a small regex) and then check
the final specifier character(s) for 'p' or 'n' (and the additional letters you
want to support) instead of only peeking one char; reference the
contains_printf_placeholder function to locate and change the '%' handling
logic. Ensure the function still returns true on detection and false otherwise.
♻️ Duplicate comments (3)
project_plan/tickets/Implement_CLI_argument_parsing_and_validation.md (1)
5-5: Redundant phrasing: "CLI interface"CLI already stands for "Command Line Interface." Simplify to "Complete the CLI with all filtering flags..."
project_plan/specs/Core_Flows__Stringy_v1.0_User_Interactions.md (1)
85-93: Add language identifiers to fenced code blocks.Code blocks at lines 85, 215, and 271 are missing language specifiers. Use
textfor output examples and table formats.project_plan/specs/Technical_Plan__Stringy_v1.0_Implementation.md (1)
1-519: Technical plan is comprehensive and well-structured.The architectural decisions are sound for a binary analysis tool. The pipeline architecture, graceful degradation strategy, and explicit error handling boundaries align with the security-focused nature of the project.
A few notes on consistency with current implementation:
once_cell migration (lines 44-53, 474-478): The plan specifies migrating from
lazy_statictoonce_cell, butCargo.tomlstill listslazy_staticas a dependency. The PR objectives confirmlazy_staticis the current implementation. Consider adding a tracking note or ticket reference for when this migration should occur.File size constraint (lines 73-74, 353): The 500-line limit and splitting strategy for
semantic.rsis documented here, but as noted in a previous review, the file currently exceeds this limit.The language specifier issues on fenced code blocks have already been flagged in previous reviews.
| **Primary Users:** | ||
|
|
||
| - **Security researchers and malware analysts** who need to quickly identify indicators of compromise (IOCs), behavioral patterns, and malicious infrastructure in binaries | ||
| - **Reverse engineers** who need to understand binary functionality, dependencies, and internal structure through string analysis | ||
| - **Open-source community** seeking a modern, intelligent alternative to decades-old tools like GNU strings | ||
| - **DevOps and security teams** who need to integrate string extraction into automated analysis pipelines | ||
|
|
||
| These users share a common need: efficient, intelligent binary analysis that surfaces meaningful information without manual noise filtering. | ||
|
|
||
| ### Current Pain Points | ||
|
|
||
| **Problem 1: Signal vs Noise** Traditional tools like GNU strings extract every printable character sequence, producing thousands of results where 90%+ are meaningless - padding bytes, binary tables, random data that happens to be printable. Users must manually scan through this noise to find the 10% that matters: URLs, file paths, registry keys, function names. This manual filtering is time-consuming, error-prone, and doesn't scale. | ||
|
|
||
| **Problem 2: Cryptic Symbols** Modern binaries contain mangled symbols (especially Rust, C++) that appear as cryptic strings like `_ZN4core3fmt3num52_$LT$impl$`. Without demangling, users cannot understand what functions are called, what libraries are used, or what the binary actually does. This forces users to copy-paste symbols into external demangling tools, breaking their workflow. | ||
|
|
||
| **Problem 3: No Prioritization** Even when users find potentially interesting strings, they have no way to know which ones are most important. Is this URL critical infrastructure or a help link? Is this file path a configuration file or a debug artifact? Without ranking and context, users waste time investigating low-value strings while missing critical indicators. | ||
|
|
There was a problem hiding this comment.
Replace bold pseudo-headings with real headings (markdownlint MD036).
Line 11 and the “Problem 1–4” labels (Lines 22–28) are bold text acting as headings. This trips MD036; please promote them to actual headings.
Suggested fix (apply similarly to all bold pseudo-headings)
-**Primary Users:**
+### Primary Users
-**Problem 1: Signal vs Noise** Traditional tools like GNU strings extract every printable character sequence, producing thousands of results where 90%+ are meaningless - padding bytes, binary tables, random data that happens to be printable. Users must manually scan through this noise to find the 10% that matters: URLs, file paths, registry keys, function names. This manual filtering is time-consuming, error-prone, and doesn't scale.
+#### Problem 1: Signal vs Noise
+Traditional tools like GNU strings extract every printable character sequence, producing thousands of results where 90%+ are meaningless - padding bytes, binary tables, random data that happens to be printable. Users must manually scan through this noise to find the 10% that matters: URLs, file paths, registry keys, function names. This manual filtering is time-consuming, error-prone, and doesn't scale.🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
11-11: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
22-22: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🤖 Prompt for AI Agents
In `@project_plan/specs/Epic_Brief__Stringy_v1.0_Completion.md` around lines 11 -
27, Replace bold pseudo-headings (e.g., "**Primary Users:**" and "**Problem 1:
Signal vs Noise**", "**Problem 2: Cryptic Symbols**", "**Problem 3: No
Prioritization**") with proper Markdown headings (for example "### Primary
Users" and "#### Problem 1: Signal vs Noise") so they are recognized as real
headings by markdownlint MD036; update each similar bold label in the file to an
appropriate heading level consistent with existing structure (e.g., "###" for
section titles, "####" for sub-sections) and remove the surrounding bold
markers.
| **In Scope:** | ||
|
|
||
| - Add dependencies to Cargo.toml: | ||
| - memmap2 for memory-mapped file I/O | ||
| - once_cell for modern lazy initialization | ||
| - indicatif for progress bars | ||
| - rustc-demangle for symbol demangling | ||
| - Implement memory mapping with fallback in Pipeline: | ||
| - Attempt memmap2::Mmap first | ||
| - Fall back to std::fs::read() on failure | ||
| - Log fallback for user awareness | ||
| - Migrate all lazy_static usage to once_cell::sync::Lazy | ||
| - Integrate indicatif progress indicators in Pipeline | ||
| - Add benchmarks for performance validation | ||
| - Update documentation with performance characteristics | ||
|
|
||
| **Out of Scope:** | ||
|
|
||
| - Pipeline implementation (separate ticket, but this ticket provides the tools) | ||
| - Classification implementation (separate ticket, but this ticket provides regex caching) |
There was a problem hiding this comment.
Use proper markdown headings for subsections.
The "In Scope:" and "Out of Scope:" labels use bold emphasis instead of proper headings. While functional, this deviates from markdown best practices and makes the document structure less clear.
📝 Proposed fix
-**In Scope:**
+### In Scope
- Add dependencies to Cargo.toml:-**Out of Scope:**
+### Out of Scope
- Pipeline implementation (separate ticket, but this ticket provides the tools)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **In Scope:** | |
| - Add dependencies to Cargo.toml: | |
| - memmap2 for memory-mapped file I/O | |
| - once_cell for modern lazy initialization | |
| - indicatif for progress bars | |
| - rustc-demangle for symbol demangling | |
| - Implement memory mapping with fallback in Pipeline: | |
| - Attempt memmap2::Mmap first | |
| - Fall back to std::fs::read() on failure | |
| - Log fallback for user awareness | |
| - Migrate all lazy_static usage to once_cell::sync::Lazy | |
| - Integrate indicatif progress indicators in Pipeline | |
| - Add benchmarks for performance validation | |
| - Update documentation with performance characteristics | |
| **Out of Scope:** | |
| - Pipeline implementation (separate ticket, but this ticket provides the tools) | |
| - Classification implementation (separate ticket, but this ticket provides regex caching) | |
| ### In Scope | |
| - Add dependencies to Cargo.toml: | |
| - memmap2 for memory-mapped file I/O | |
| - once_cell for modern lazy initialization | |
| - indicatif for progress bars | |
| - rustc-demangle for symbol demangling | |
| - Implement memory mapping with fallback in Pipeline: | |
| - Attempt memmap2::Mmap first | |
| - Fall back to std::fs::read() on failure | |
| - Log fallback for user awareness | |
| - Migrate all lazy_static usage to once_cell::sync::Lazy | |
| - Integrate indicatif progress indicators in Pipeline | |
| - Add benchmarks for performance validation | |
| - Update documentation with performance characteristics | |
| ### Out of Scope | |
| - Pipeline implementation (separate ticket, but this ticket provides the tools) | |
| - Classification implementation (separate ticket, but this ticket provides regex caching) |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
11-11: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
22-22: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🤖 Prompt for AI Agents
In `@project_plan/tickets/Add_performance_optimizations_and_dependency_updates.md`
around lines 9 - 28, The document uses bold labels "**In Scope:**" and "**Out of
Scope:**" instead of markdown headings; update the file to use proper markdown
headings (e.g., "## In Scope" and "## Out of Scope") and convert the subsection
lists under them to be clearly nested under those headings so the structure is
semantic and parsable; ensure the header level matches the document hierarchy
(use "###" for sublists if needed) and preserve all listed items (memmap2,
once_cell, indicatif, rustc-demangle, memory mapping fallback behavior,
lazy_static migration, indicatif integration, benchmarks, docs update, and the
out-of-scope notes) while removing the bolded label style.
| **In Scope:** | ||
|
|
||
| - Expand tests/fixtures/ with diverse test binaries: | ||
| - ELF binaries with various string types | ||
| - PE binaries with resources and imports | ||
| - Mach-O binaries with load commands | ||
| - Binaries with mangled symbols | ||
| - Create end-to-end integration tests: | ||
| - Test complete pipeline with each binary format | ||
| - Test all output formats (table, JSON, YARA) | ||
| - Test filtering combinations | ||
| - Test error scenarios (corrupted binaries, unsupported formats) | ||
| - Test edge cases (no strings, all strings filtered out) | ||
| - Add insta snapshot tests for output validation | ||
| - Add CLI integration tests | ||
| - Validate against all requirements from file:.kiro/specs/stringy-binary-analyzer/requirements.md | ||
| - Add performance benchmarks for complete pipeline | ||
|
|
||
| **Out of Scope:** | ||
|
|
||
| - Unit tests for individual components (handled in component tickets) | ||
| - Implementation of components (separate tickets) | ||
|
|
There was a problem hiding this comment.
Use headings instead of bold labels for section titles.
Lines 9–10 and 27–28 use bold text as headings, which violates markdownlint MD036.
Suggested fix
-**In Scope:**
+### In Scope
-**Out of Scope:**
+### Out of Scope📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **In Scope:** | |
| - Expand tests/fixtures/ with diverse test binaries: | |
| - ELF binaries with various string types | |
| - PE binaries with resources and imports | |
| - Mach-O binaries with load commands | |
| - Binaries with mangled symbols | |
| - Create end-to-end integration tests: | |
| - Test complete pipeline with each binary format | |
| - Test all output formats (table, JSON, YARA) | |
| - Test filtering combinations | |
| - Test error scenarios (corrupted binaries, unsupported formats) | |
| - Test edge cases (no strings, all strings filtered out) | |
| - Add insta snapshot tests for output validation | |
| - Add CLI integration tests | |
| - Validate against all requirements from file:.kiro/specs/stringy-binary-analyzer/requirements.md | |
| - Add performance benchmarks for complete pipeline | |
| **Out of Scope:** | |
| - Unit tests for individual components (handled in component tickets) | |
| - Implementation of components (separate tickets) | |
| ### In Scope | |
| - Expand tests/fixtures/ with diverse test binaries: | |
| - ELF binaries with various string types | |
| - PE binaries with resources and imports | |
| - Mach-O binaries with load commands | |
| - Binaries with mangled symbols | |
| - Create end-to-end integration tests: | |
| - Test complete pipeline with each binary format | |
| - Test all output formats (table, JSON, YARA) | |
| - Test filtering combinations | |
| - Test error scenarios (corrupted binaries, unsupported formats) | |
| - Test edge cases (no strings, all strings filtered out) | |
| - Add insta snapshot tests for output validation | |
| - Add CLI integration tests | |
| - Validate against all requirements from file:.kiro/specs/stringy-binary-analyzer/requirements.md | |
| - Add performance benchmarks for complete pipeline | |
| ### Out of Scope | |
| - Unit tests for individual components (handled in component tickets) | |
| - Implementation of components (separate tickets) |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
11-11: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
22-22: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🤖 Prompt for AI Agents
In `@project_plan/tickets/End-to-end_integration_testing_and_validation.md` around
lines 9 - 31, Replace the bolded section labels "**In Scope:**" and "**Out of
Scope:**" with proper Markdown headings (for example "## In Scope" and "## Out
of Scope") in the document so they are real headings instead of emphasized text;
update the two occurrences matching the literal phrases "In Scope:" and "Out of
Scope:" in the file to use heading syntax and keep the existing lists and
content unchanged to satisfy markdownlint MD036.
| **In Scope:** | ||
|
|
||
| - Add `original_text: Option<String>` field to preserve mangled symbols | ||
| - Add optional breakdown fields: `section_weight: Option<i32>`, `semantic_boost: Option<i32>`, `noise_penalty: Option<i32>` | ||
| - Update serde serialization to handle new fields correctly | ||
| - Update existing tests to account for new fields | ||
| - Update documentation with field descriptions | ||
|
|
||
| **Out of Scope:** | ||
|
|
||
| - Actual demangling logic (handled in separate ticket) | ||
| - Score calculation logic (handled in separate ticket) | ||
| - CLI --debug flag implementation (handled in pipeline ticket) |
There was a problem hiding this comment.
Use headings instead of bold labels for section titles.
Lines 9–10 and 17–21 use bold text as headings, which violates markdownlint MD036.
Suggested fix
-**In Scope:**
+### In Scope
-**Out of Scope:**
+### Out of Scope🤖 Prompt for AI Agents
In
`@project_plan/tickets/Extend_FoundString_data_model_for_demangling_and_debug_support.md`
around lines 9 - 21, Replace the bold label lines "**In Scope:**" and "**Out of
Scope:**" with proper markdown headings (e.g., "## In Scope" and "## Out of
Scope") so they conform to markdownlint MD036; update the surrounding lists
under the headings as needed to preserve the same content and spacing and ensure
serialization/tests/docs references to these sections still read the new heading
text.
| **In Scope:** | ||
|
|
||
| - Create file:src/output/mod.rs with OutputFormat enum and format_output() function | ||
| - Create file:src/output/table.rs: | ||
| - TTY detection using std::io::IsTerminal | ||
| - Table formatting with columns: String | Tags | Score | Section | ||
| - Non-TTY plain text output (one string per line) | ||
| - Primary tag display with comma-separation for same-priority tags | ||
| - Create file:src/output/json.rs: | ||
| - JSONL output (one JSON object per line) | ||
| - Include all FoundString fields | ||
| - Include original_text if present | ||
| - Include breakdown fields only if populated | ||
| - Create file:src/output/yara.rs: | ||
| - Complete YARA rule template generation | ||
| - Binary filename sanitization for rule names | ||
| - String escaping for YARA syntax | ||
| - Skip strings over 200 chars with comment | ||
| - Include metadata section (description, tool, date) | ||
| - Add comprehensive unit tests for each formatter | ||
| - Add integration tests with insta snapshots | ||
|
|
||
| **Out of Scope:** | ||
|
|
||
| - CLI integration (separate ticket) | ||
| - Summary statistics formatting (handled in pipeline ticket) | ||
| - Progress feedback (separate ticket) | ||
|
|
There was a problem hiding this comment.
Use headings instead of bold labels for section titles.
Lines 9–10 and 31–35 use bold text as headings, which violates markdownlint MD036.
Suggested fix
-**In Scope:**
+### In Scope
-**Out of Scope:**
+### Out of Scope📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **In Scope:** | |
| - Create file:src/output/mod.rs with OutputFormat enum and format_output() function | |
| - Create file:src/output/table.rs: | |
| - TTY detection using std::io::IsTerminal | |
| - Table formatting with columns: String | Tags | Score | Section | |
| - Non-TTY plain text output (one string per line) | |
| - Primary tag display with comma-separation for same-priority tags | |
| - Create file:src/output/json.rs: | |
| - JSONL output (one JSON object per line) | |
| - Include all FoundString fields | |
| - Include original_text if present | |
| - Include breakdown fields only if populated | |
| - Create file:src/output/yara.rs: | |
| - Complete YARA rule template generation | |
| - Binary filename sanitization for rule names | |
| - String escaping for YARA syntax | |
| - Skip strings over 200 chars with comment | |
| - Include metadata section (description, tool, date) | |
| - Add comprehensive unit tests for each formatter | |
| - Add integration tests with insta snapshots | |
| **Out of Scope:** | |
| - CLI integration (separate ticket) | |
| - Summary statistics formatting (handled in pipeline ticket) | |
| - Progress feedback (separate ticket) | |
| ### In Scope | |
| - Create file:src/output/mod.rs with OutputFormat enum and format_output() function | |
| - Create file:src/output/table.rs: | |
| - TTY detection using std::io::IsTerminal | |
| - Table formatting with columns: String | Tags | Score | Section | |
| - Non-TTY plain text output (one string per line) | |
| - Primary tag display with comma-separation for same-priority tags | |
| - Create file:src/output/json.rs: | |
| - JSONL output (one JSON object per line) | |
| - Include all FoundString fields | |
| - Include original_text if present | |
| - Include breakdown fields only if populated | |
| - Create file:src/output/yara.rs: | |
| - Complete YARA rule template generation | |
| - Binary filename sanitization for rule names | |
| - String escaping for YARA syntax | |
| - Skip strings over 200 chars with comment | |
| - Include metadata section (description, tool, date) | |
| - Add comprehensive unit tests for each formatter | |
| - Add integration tests with insta snapshots | |
| ### Out of Scope | |
| - CLI integration (separate ticket) | |
| - Summary statistics formatting (handled in pipeline ticket) | |
| - Progress feedback (separate ticket) |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
11-11: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
22-22: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
33-33: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🤖 Prompt for AI Agents
In
`@project_plan/tickets/Implement_enum-based_output_formatters_`(Table,_JSON,_YARA).md
around lines 9 - 36, Replace the bold labels "**In Scope:**" and "**Out of
Scope:**" with proper markdown headings (e.g., "## In Scope" and "## Out of
Scope") so the document uses headings instead of bold text for section titles;
update the file's sections where the enum-based output formatter plan defines
scope and out-of-scope items (the sections currently titled "In Scope:" and "Out
of Scope:") and ensure any similar bolded section labels in that file are
converted to appropriate heading levels for compliance with MD036.
| /// Known valid Windows path prefixes | ||
| static ref KNOWN_WINDOWS_PREFIXES: std::collections::HashSet<&'static str> = { | ||
| let mut set = std::collections::HashSet::new(); | ||
| set.insert("C:\\Windows\\"); | ||
| set.insert("C:\\Program Files\\"); | ||
| set.insert("C:\\Program Files (x86)\\"); | ||
| set.insert("C:\\Users\\"); | ||
| set.insert("C:\\ProgramData\\"); | ||
| set | ||
| }; |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider: KNOWN_WINDOWS_PREFIXES only covers C: drive.
The set only includes C:\ prefixed paths. Paths on other drives (D:, E:) won't match known prefixes. This is likely acceptable since it's used for heuristic validation (allowing longer unknown paths), not rejection. Verify this is the intended behavior.
🤖 Prompt for AI Agents
In `@src/classification/semantic.rs` around lines 171 - 180,
KNOWN_WINDOWS_PREFIXES currently only lists C:\ paths so paths on other drive
letters won't match; update the heuristic by either replacing the fixed HashSet
with a drive-letter pattern check (e.g., test for a leading "[A-Za-z]:\" prefix)
or expand the set to include other drive letters (generate "D:\\", "E:\\",
etc.), and update any callers that rely on KNOWN_WINDOWS_PREFIXES accordingly
(reference: KNOWN_WINDOWS_PREFIXES in src/classification/semantic.rs).
| fn contains_printf_placeholder(&self, text: &str) -> bool { | ||
| let mut chars = text.chars().peekable(); | ||
|
|
||
| while let Some(ch) = chars.next() { | ||
| if ch == '%' | ||
| && let Some(next) = chars.peek() | ||
| && matches!(next, 's' | 'd' | 'x' | 'o' | 'u' | 'f') | ||
| { | ||
| return true; | ||
| } | ||
| } | ||
|
|
||
| false | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Printf placeholder detection is conservative.
The pattern only matches %s, %d, %x, %o, %u, %f. This misses %i, %p, %n, %c, %e, %g, %ld, %lld, etc. For a false-positive-reduction focus, this conservative approach is reasonable, but consider expanding to at least include %p (pointer) and %n (dangerous format string specifier) which are security-relevant.
Optional: Expand printf specifiers
fn contains_printf_placeholder(&self, text: &str) -> bool {
let mut chars = text.chars().peekable();
while let Some(ch) = chars.next() {
if ch == '%'
&& let Some(next) = chars.peek()
- && matches!(next, 's' | 'd' | 'x' | 'o' | 'u' | 'f')
+ && matches!(next, 's' | 'd' | 'x' | 'o' | 'u' | 'f' | 'i' | 'p' | 'n' | 'c')
{
return true;
}
}
false
}🤖 Prompt for AI Agents
In `@src/classification/semantic.rs` around lines 829 - 842, The
contains_printf_placeholder function currently only detects a small set of
single-letter specifiers; update it to also catch security-relevant specifiers
like %p and %n (and optionally other common specifiers such as %i, %c, %e, %g
and length-augmented forms like %ld/%lld). Modify contains_printf_placeholder
to, after seeing '%', consume any flags/width/precision/length tokens (or switch
to a small regex) and then check the final specifier character(s) for 'p' or 'n'
(and the additional letters you want to support) instead of only peeking one
char; reference the contains_printf_placeholder function to locate and change
the '%' handling logic. Ensure the function still returns true on detection and
false otherwise.
…indows, and registry paths (#121) * feat(docs): add AI agent guidelines and character usage policy Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * chore(docs): revise AI agent guidelines for clarity and rules Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * chore(docs): update module structure formatting in documentation Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * feat(classification): implement file path classification for POSIX and Windows - Added POSIX and Windows file path pattern matching - Included registry path detection - Comprehensive unit and integration tests added Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * feat(classification): enhance path and registry detection - Implement POSIX, Windows, and UNC path classification - Add support for Windows registry path detection - Update documentation and tests for new capabilities Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * chore: minor docs and test adjustments - Fix MSRV reference from 1.91+ to 1.85+ in copilot instructions - Reorganize classification docs for clarity - Increase classification test timeout for CI reliability * fix(classification): address code review feedback on path classification performance and consistency (#122) * Initial plan * fix: address code review feedback on path classification - Clarify Windows path separator validation in docs - Optimize registry path matching with zero-allocation case-insensitive search - Prevent duplicate Tag::FilePath entries - Add comment explaining 500ms performance test timeout Co-authored-by: unclesp1d3r <251112+unclesp1d3r@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: unclesp1d3r <251112+unclesp1d3r@users.noreply.github.com> * chore: add comprehensive codebase analysis documentation Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * chore: add CodeRabbit configuration file for project setup This configuration file includes general settings, review guidelines, path filters, and instructions for various modules to streamline code reviews and maintain project standards. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * chore: improve formatting and readability in codebase analysis Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * chore: update formatting in copilot instructions Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * chore: update Cargo.toml and codebase_analysis.md formatting Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * chore: refresh task list to reflect current implementation state * chore: add documentation for core flows and technical plan - Introduced detailed documentation for core user interactions in Stringy v1.0. - Added technical plan outlining architecture, design decisions, and integration strategies. - Included performance optimization ticket and end-to-end testing strategy. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * chore: add MSRV check to CI workflow Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * chore: update character restrictions in copilot instructions Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * chore: update documentation and improve formatting - Adjusted numbering format in user interaction flows for consistency. - Enhanced clarity in the Epic Brief by refining problem statements. - Improved formatting in the Technical Plan for better readability. - Updated performance optimization ticket to reflect new dependencies. - Enhanced clarity in the pipeline orchestration ticket scope. - Refined semantic classification ticket to include additional patterns. - Improved integration testing ticket to cover more binary types. - Updated FoundString data model ticket for clarity on new fields. - Enhanced CLI argument parsing ticket to include all flags. - Improved output formatters ticket to clarify scope and dependencies. - Refined ranking system ticket to ensure clarity on scoring logic. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * chore: update directory structure path in analysis Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * chore: update Cargo.toml and rust-toolchain for Rust 1.91 - Bump rust-version in Cargo.toml to 1.91 - Update rust-toolchain.toml to use channel 1.91.0 - Enhance semantic classification for Windows paths with case-insensitivity and additional validation checks - Add new test file for let chains example Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> --------- Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Summary
This PR implements comprehensive file path and registry path classification for the semantic classifier, completing task 9.3 from the implementation plan. The changes add detection for POSIX paths, Windows paths, UNC network paths, and Windows registry paths with validation and suspicious path identification.
Impact: 10 files changed (+1,228 additions, -182 deletions)
Risk Level: Low - adds new functionality without modifying existing behavior
Review Time: ~15-20 minutes
What Changed
Source Changes
src/classification/semantic.rs(+655 lines): Core implementation of path classificationsrc/classification/mod.rs(+10 lines): Module documentation updatesTest Changes
tests/classification_integration.rs(+152 lines): Integration tests for all path typestests/snapshots/classification_integration__classification_snapshots.snap(+43 lines): Snapshot testsDocumentation Changes
docs/src/classification.md(+86/-168 lines): Updated classification documentation.github/copilot-instructions.md(+208 lines): New Copilot instructions fileAGENTS.md(+69 lines): AI agent guidelinesCLAUDE.md(+1 line): Claude configurationConfiguration Changes
.claude/settings.json(+5 lines): Claude Code plugin settings.kiro/specs/stringy-binary-analyzer/tasks.md(+9 lines): Task tracking updateImplementation Details
Path Classification Features
^/[^\0\n\r]*/usr/bin/bash^[A-Za-z]:\\[^\0\n\r]*C:\Windows\System32\cmd.exe^\\\\[a-zA-Z0-9.-]+\\[^\0\n\r]*\\server\share\file.txt^HKEY_[A-Z_]+\\[^\0\n\r]*HKEY_LOCAL_MACHINE\SOFTWARE\...Validation Rules
//for POSIX,\\for Windows after initial UNC prefix)/)Suspicious Path Detection
The classifier identifies known persistence/execution locations:
POSIX:
/etc/cron.d/,/etc/init.d/,/tmp/,/var/tmp/,/Library/LaunchDaemons/Windows:
C:\Windows\System32\,C:\Windows\Temp\,...\AppData\Roaming\...\Startup\Registry:
\SOFTWARE\Microsoft\Windows\CurrentVersion\Run,\System\CurrentControlSet\ServicesArchitecture
All regex patterns are compiled once via
lazy_static!and cached across classifier instances.How Has This Been Tested?
Unit Tests (in
semantic.rs)Integration Tests (in
classification_integration.rs)Review Checklist
General
Code Quality
Option<Tag>consistentlyTesting
Security
Breaking Changes
None. This PR adds new functionality without modifying existing behavior.
Dependencies
No new dependencies added. Uses existing:
lazy_staticfor regex cachingregexfor pattern matchinginstafor snapshot testingRelated Issues
Closes task 9.3: Implement file path classification
Generated with Claude Code