-
-
Notifications
You must be signed in to change notification settings - Fork 1
feat(classification): implement file path classification for POSIX, Windows, and registry paths #121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
unclesp1d3r
merged 19 commits into
main
from
17-implement-file-path-classification-for-posix-windows-and-registry-paths
Jan 18, 2026
Merged
feat(classification): implement file path classification for POSIX, Windows, and registry paths #121
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
0182e8b
feat(docs): add AI agent guidelines and character usage policy
unclesp1d3r ff7130c
chore(docs): revise AI agent guidelines for clarity and rules
unclesp1d3r e4c82ca
chore(docs): update module structure formatting in documentation
unclesp1d3r dd404fe
feat(classification): implement file path classification for POSIX an…
unclesp1d3r ab03844
feat(classification): enhance path and registry detection
unclesp1d3r 449e425
chore: minor docs and test adjustments
unclesp1d3r 6ec87f8
fix(classification): address code review feedback on path classificat…
Copilot 6f51cca
chore: add comprehensive codebase analysis documentation
unclesp1d3r bb7de66
chore: add CodeRabbit configuration file for project setup
unclesp1d3r d421b05
chore: improve formatting and readability in codebase analysis
unclesp1d3r d099313
chore: update formatting in copilot instructions
unclesp1d3r 1dd48a1
chore: update Cargo.toml and codebase_analysis.md formatting
unclesp1d3r d351b82
chore: refresh task list to reflect current implementation state
unclesp1d3r 034cbe3
chore: add documentation for core flows and technical plan
unclesp1d3r 4d30cbf
chore: add MSRV check to CI workflow
unclesp1d3r 3bdbf53
chore: update character restrictions in copilot instructions
unclesp1d3r d22a553
chore: update documentation and improve formatting
unclesp1d3r 4113f60
chore: update directory structure path in analysis
unclesp1d3r 8b97657
chore: update Cargo.toml and rust-toolchain for Rust 1.91
unclesp1d3r File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| { | ||
| "enabledPlugins": { | ||
| "commit@cc-marketplace": true | ||
| } | ||
| } |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,208 @@ | ||
| # Copilot Instructions for Stringy | ||
|
|
||
| ## Project Overview | ||
|
|
||
| Stringy is a **smarter strings tool** for extracting meaningful strings from ELF, PE, and Mach-O binaries using format-specific knowledge and semantic classification. Unlike the standard `strings` command, Stringy is data-structure aware, section-aware, and semantically intelligent. | ||
|
|
||
| ## Architecture & Data Flow | ||
|
|
||
| ```text | ||
| Binary -> Format Detection (goblin) -> Container Parsing -> String Extraction -> Deduplication -> Classification -> Ranking -> Output | ||
| ``` | ||
|
|
||
| ### Module Organization | ||
|
|
||
| - **`container/`** \[COMPLETE\]: Format detection (ELF/PE/Mach-O), section analysis, imports/exports via `goblin` | ||
| - **`extraction/`** \[COMPLETE\]: ASCII/UTF-8/UTF-16 string extraction, deduplication, PE resources | ||
| - **`classification/`** \[PARTIAL\]: Semantic tagging (URLs, IPs, domains, paths, GUIDs, etc.) | ||
| - **`output/`** \[PLANNED\]: JSON/human-readable/YARA-friendly formatting | ||
| - **`types/`** \[COMPLETE\]: Core data structures (`FoundString`, `ContainerInfo`, etc.), error handling | ||
|
|
||
| ## Critical Coding Standards | ||
|
|
||
| ### Zero Tolerance Policies | ||
|
|
||
| - **No `unsafe` code**: `#![forbid(unsafe_code)]` enforced at package level | ||
| - **Zero warnings**: `cargo clippy -- -D warnings` must pass (`#![deny(warnings)]` enforced) | ||
| - **Rust 2024 Edition**: MSRV 1.85+, always use latest edition features | ||
| - **File size limit**: Keep files \<=500-600 lines; split larger files into focused modules | ||
| - **No blanket `#[allow]`**: Any `allow` attribute requires inline justification and cannot apply to entire files/modules | ||
| - **Character restrictions**: Never use emojis, em-dashes, or other non-Latin characters in code or documentation. Use standard ASCII punctuation (hyphens, quotes, etc.) | ||
|
|
||
| ### Error Handling with `thiserror` | ||
|
|
||
| Use structured errors with detailed context (see `src/types.rs`): | ||
|
|
||
| ```rust | ||
| #[derive(Debug, Error)] | ||
| pub enum StringyError { | ||
| #[error("Binary parsing error: {0}")] | ||
| ParseError(String), | ||
|
|
||
| #[error("Invalid encoding at offset {offset}")] | ||
| EncodingError { offset: u64 }, | ||
| } | ||
| ``` | ||
|
|
||
| Convert external errors with `From` implementations. Provide offsets, section names, and file paths in error messages. | ||
|
|
||
| ## Key Implementation Patterns | ||
|
|
||
| ### Section Weight System | ||
|
|
||
| Container parsers assign weights (1.0-10.0) to sections based on string likelihood: | ||
|
|
||
| ```rust | ||
| // ELF example from container/elf.rs | ||
| ".rodata" | ".rodata.str1.*" => 10.0 // Highest priority | ||
| ".comment" | ".note.*" => 9.0 // Build info | ||
| ".data.rel.ro" => 7.0 // Read-only data | ||
| ".data" => 5.0 // Writable data (lower priority) | ||
| ``` | ||
|
|
||
| **Pattern**: Use match expressions with fallthrough to assign weights; higher = more likely to contain meaningful strings. | ||
|
|
||
| ### String Deduplication (`extraction/dedup.rs`) | ||
|
|
||
| Strings are grouped by `(text, encoding)` tuple in a `HashMap<(String, Encoding), Vec<StringOccurrence>>`: | ||
|
|
||
| - **Preserve all occurrences**: Each occurrence captures offset, RVA, section, source, tags, score, confidence | ||
| - **Tag merging**: Union all tags via `HashSet`, then sort | ||
| - **Combined scoring formula**: | ||
| ```text | ||
| base_score = max(occurrence.original_score) | ||
| occurrence_bonus = 5 * (count - 1) | ||
| cross_section_bonus = 10 (if >1 unique section) | ||
| multi_source_bonus = 15 (if >1 unique StringSource) | ||
| confidence_boost = (max_confidence * 10.0) as i32 | ||
| ``` | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
|
|
||
| ### Non-Exhaustive Structs | ||
|
|
||
| Use `#[non_exhaustive]` for public API structs like `ContainerInfo` and provide explicit constructors (see `types.rs`): | ||
|
|
||
| ```rust | ||
| #[non_exhaustive] | ||
| pub struct ContainerInfo { /* fields */ } | ||
|
|
||
| impl ContainerInfo { | ||
| pub fn new(format: BinaryFormat, sections: Vec<SectionInfo>, ...) -> Self { ... } | ||
| } | ||
| ``` | ||
|
|
||
| ## Testing Standards | ||
|
|
||
| - **Snapshot testing**: Use `insta` for output verification (`tests/integration_*.rs`) | ||
| - **Fixtures**: Binary test fixtures in `tests/fixtures/` (see `fixtures/README.md`) | ||
| - **Integration tests**: Named `test_*.rs` or `integration_*.rs` in `tests/` | ||
| - **Run tests**: `just test` (uses `cargo nextest`) | ||
|
|
||
| Example pattern from `tests/integration_elf.rs`: | ||
|
|
||
| ```rust | ||
| fn get_fixture_path(name: &str) -> PathBuf { | ||
| Path::new(env!("CARGO_MANIFEST_DIR")) | ||
| .join("tests/fixtures") | ||
| .join(name) | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_elf_import_export_extraction() { | ||
| let data = fs::read(&get_fixture_path("test_binary_elf")).expect("..."); | ||
| let parser = ElfParser::new(); | ||
| let info = parser.parse(&data).expect("..."); | ||
| // Verify imports/exports with specific assertions | ||
| } | ||
| ``` | ||
|
|
||
| ## Development Workflow | ||
|
|
||
| ### Common Commands (`justfile`) | ||
|
|
||
| **Setup**: `just setup` (installs rustfmt, clippy, llvm-tools-preview, mdformat) | ||
|
|
||
| **Development**: | ||
|
|
||
| - `just build` - Debug build | ||
| - `just test` - Run tests with nextest | ||
| - `just lint` - Full lint suite (rustfmt, clippy, actionlint, cspell, markdown) | ||
| - `just check` - Pre-commit checks + lint | ||
| - `just run <file>` - Run binary against test file | ||
|
|
||
| **Code Quality**: | ||
|
|
||
| - `just fmt` - Format Rust/markdown/YAML/JSON | ||
| - `just fix` - Auto-fix clippy warnings with `--fix` | ||
| - `just coverage` - Generate LCOV coverage report | ||
|
|
||
| **CI Parity**: `just ci-check` (runs full CI suite locally) | ||
|
|
||
| ### Windows vs Unix | ||
|
|
||
| The `justfile` uses OS annotations (`[windows]`/`[unix]`) for cross-platform compatibility. PowerShell on Windows, bash on Unix. | ||
|
|
||
| ## Dependencies & Crates | ||
|
|
||
| **Core parsing**: `goblin` (ELF/PE/Mach-O), `pelite` (PE resources)\ | ||
| **CLI**: `clap` with derive macros\ | ||
| **Error handling**: `thiserror`\ | ||
| **Serialization**: `serde`, `serde_json`\ | ||
| **Regex**: `regex` for classification\ | ||
| **Testing**: `insta` (snapshots), `criterion` (benchmarks), `tempfile` | ||
|
|
||
| ## Import Conventions | ||
|
|
||
| - Re-export commonly used types in `lib.rs` for ergonomic imports | ||
| - Import from `stringy::extraction` or `stringy::types`, not deeply nested paths | ||
| - Within `extraction/mod.rs`, do NOT import locally-defined types; downstream code imports from `stringy::extraction` | ||
|
|
||
| ## What NOT to Do | ||
|
|
||
| - Don't use `async` (this is a synchronous CLI tool) | ||
| - Don't add `unsafe` blocks (forbidden) | ||
| - Don't ignore clippy warnings (they're errors) | ||
| - Don't create files >600 lines without splitting | ||
| - Don't use blanket `#[allow]` on modules/files | ||
| - Don't guess at section weights (refer to existing parsers in `container/`) | ||
|
|
||
| ## Current Implementation Status | ||
|
|
||
| **Complete**: | ||
|
|
||
| - ELF/PE/Mach-O format detection and parsing | ||
| - ASCII, UTF-8, UTF-16LE/BE string extraction | ||
| - PE resource string extraction (VERSIONINFO, STRINGTABLE, MANIFEST) | ||
| - String deduplication with occurrence tracking | ||
| - IPv4/IPv6, URL, domain classification | ||
|
|
||
| **In Progress**: | ||
|
|
||
| - Full semantic classification suite (GUIDs, paths, format strings, Base64) | ||
| - Ranking/scoring algorithm implementation | ||
| - CLI (`main.rs` is placeholder) | ||
| - Output formatters (JSON, YARA-friendly, human-readable) | ||
|
|
||
| ## Quick Reference Examples | ||
|
|
||
| **Adding a new section weight** (in `container/elf.rs`, `pe.rs`, or `macho.rs`): | ||
|
|
||
| ```rust | ||
| let weight = match section_name { | ||
| ".mydata" => 8.0, // New section type | ||
| _ => existing_match_arms | ||
| }; | ||
| ``` | ||
|
|
||
| **Extracting strings from a section**: | ||
|
|
||
| ```rust | ||
| use stringy::extraction::{extract_ascii_strings, AsciiExtractionConfig}; | ||
| let config = AsciiExtractionConfig { min_length: 4, max_length: 1024 }; | ||
| let strings = extract_ascii_strings(§ion_data, &config); | ||
| ``` | ||
|
|
||
| **Adding a semantic tag**: | ||
|
|
||
| 1. Add variant to `Tag` enum in `types.rs` | ||
| 2. Implement pattern matching in `classification/semantic.rs` | ||
| 3. Update deduplication tag merging if needed | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.