refactor(extraction): split oversized modules and wire up CLI pipeline by unclesp1d3r · Pull Request #135 · EvilBit-Labs/Stringy

unclesp1d3r · 2026-02-17T15:11:19Z

Summary

CI: Use just recipes in CI workflows instead of raw cargo commands; fix cargo-dist invocation (use dist plan binary via mise instead of cargo dist check subcommand)
Extraction refactor: Split 6 oversized files (all >500 lines) in src/extraction/ into focused module directories -- total lines unchanged, all non-test files now under 500 lines
CLI: Wire up the extraction pipeline in main.rs with format selection (--format table|json|yara) and minimum length filtering (--min-length)
Design debt cleanup: Resolve 8 pre-existing design issues surfaced during bot review

Details

CI changes

ci.yml: test, test-cross-platform, and coverage jobs now use just test-ci, just build-release, just coverage
security.yml: uses just deny, just outdated, just dist-check (fixes cargo-dist not found)
docs.yml: uses just docs-build for combined mdbook + rustdoc build
justfile: updated recipes with --all-features, added outdated recipe, fixed dist-check to use dist plan

Extraction module splits

Original file	Lines	Split into
`extraction/mod.rs`	1,576	`mod.rs` (154) + `traits.rs` + `helpers.rs` + `basic_extractor.rs` + `tests.rs`
`extraction/filters.rs`	702	`filters/mod.rs` + `filters/implementations.rs` + `filters/tests.rs`
`extraction/ascii.rs`	832	`ascii/mod.rs` + `ascii/extraction.rs` + `ascii/tests.rs`
`extraction/utf16.rs`	1,273	`utf16/mod.rs` + `utf16/config.rs` + `utf16/validation.rs` + `utf16/confidence.rs` + `utf16/extraction.rs` + `utf16/tests.rs`
`extraction/dedup.rs`	838	`dedup/mod.rs` + `dedup/scoring.rs` + `dedup/tests.rs`
`extraction/pe_resources.rs`	1,449	`pe_resources/mod.rs` + `pe_resources/detection.rs` + `pe_resources/version_info.rs` + `pe_resources/string_tables.rs` + `pe_resources/manifests.rs` + `pe_resources/tests.rs`

CLI pipeline

main.rs: reads binary, detects format, parses container, extracts strings, formats output
4 new integration tests in tests/integration_cli.rs

Design debt cleanup (8 items)

Remove dual encoding fields: Deleted encodings from ExtractionConfig, kept only enabled_encodings. Eliminated confusing OR-checks in basic_extractor.rs.
Extract shared logic: Factored duplicated section-sorting/filtering/symbol-inclusion pipeline from extract() and extract_canonical() into collect_all_strings() helper.
Fix O(n^2) dedup in Auto mode: Replaced strings.iter_mut().find() with HashMap-based O(1) lookup in utf16/mod.rs.
Remove unused parameter: Removed _preserve_all_occurrences from deduplicate() signature, ExtractionConfig, and all callsites.
Remove broken byte order heuristic: check_byte_order_consistency() operated on decoded u16 values (identical regardless of byte order), making LE always score 1.0 and BE always 0.0. Removed and redistributed weight.
Fix char_count semantics: Changed char_count += 1 to += consumed_units in UTF-16 extraction so min/max length checks count code units, not scalar values.
Replace Result<_, ()>: Changed 3 UTF-16 decode functions to return StringyError::ParseError, removed #[allow(clippy::result_unit_err)].
Add surrogate noncharacter validation: Reject U+xFFFE/U+xFFFF in supplementary planes (planes 1-16), matching existing BMP noncharacter check.

Test plan

just ci-check passes locally (478 tests, zero clippy warnings)
CI passes on all platforms (ubuntu, macos, windows)
CodeRabbit/Copilot review feedback addressed
Pre-existing design debt resolved (8 items)

Generated with Claude Code

Centralize CI commands in justfile as single source of truth: - Replace raw cargo commands with just recipes in test, test-cross-platform, coverage, security, and docs workflows - Fix cargo-dist issue in security workflow by using `dist plan` via mise exec instead of `cargo dist check` (subcommand not found) - Add --all-features to test-ci, build-release, and coverage recipes - Add outdated recipe for cargo-outdated checks - Fix dist-check recipe to use `dist plan` (check subcommand does not exist in cargo-dist 0.30.3) Quality and MSRV jobs unchanged (use dtolnay Rust, not mise). Signed-off-by: Kevin Melton <kevin@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Move the ~400-line `impl StringExtractor for BasicExtractor` block from mod.rs into its own file, reducing mod.rs from 1038 to 630 lines. The new basic_extractor.rs (421 lines) contains the core extraction engine that orchestrates ASCII, UTF-8, and UTF-16 string extraction across binary sections. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Convert the 702-line filters.rs into a module directory with three focused files: - mod.rs (193 lines): FilterContext, NoiseFilter trait, CompositeNoiseFilter - implementations.rs (394 lines): all 6 filter structs and NoiseFilter impls - tests.rs (131 lines): all test code All public re-exports remain unchanged so the API is not affected. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Convert src/extraction/ascii.rs (832 lines) into a module directory with three focused files: - mod.rs (147 lines): AsciiExtractionConfig, is_printable_ascii(), re-exports - extraction.rs (316 lines): extract_ascii_strings(), extract_from_section() - tests.rs (378 lines): all unit tests All files are under the 500-line limit. All 473 tests pass. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Convert the 1,273-line src/extraction/utf16.rs into a module directory with six focused files, each under 500 lines: - mod.rs (122 lines): ByteOrder enum, extract_utf16_strings orchestrator, re-exports - config.rs (97 lines): Utf16ExtractionConfig struct and constructors - validation.rs (281 lines): UTF-16 sequence validation, printability checks, null patterns - confidence.rs (201 lines): Confidence scoring heuristics (ASCII ratio, printable ratio, etc.) - extraction.rs (380 lines): Core extraction functions and section-aware extraction - tests.rs (238 lines): All unit tests All public exports remain identical. No functional changes. Signed-off-by: Kyle Melton <kyle@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Convert the 838-line dedup.rs into a module directory with three files: - mod.rs (237 lines): CanonicalString, StringOccurrence structs, deduplicate(), merge_tags(), found_string_to_occurrence(), CanonicalString::to_found_string() - scoring.rs (56 lines): calculate_combined_score() - tests.rs (555 lines): All test code Public API and re-exports remain unchanged. Signed-off-by: Kevin Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Convert the largest file (1,449 lines) into a module directory with six focused files, each under 500 lines: - mod.rs (230): constants, entry points, orchestrator, decode_utf16le - detection.rs (208): detect_version_info, detect_string_tables, detect_manifests - version_info.rs (76): extract_version_info_strings - string_tables.rs (200): parse_string_table_block, extract_string_table_strings - manifests.rs (201): detect_manifest_encoding, decode_manifest, extract_manifest_strings - tests.rs (594): all test code All 473 tests pass. Public API unchanged. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Document the CI workflow patterns (dtolnay vs mise, just recipes as single source of truth) and Mergify behavior in AGENTS.md. Add the extraction refactor and CLI implementation plan for project history. Signed-off-by: Keith Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

coderabbitai · 2026-02-17T15:11:41Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This pull request implements a comprehensive refactor of the extraction module architecture, restructuring flat extraction modules (ASCII, UTF-16, PE resources, deduplication) into modular subdirectories with trait-based abstractions. It introduces a BasicExtractor implementation, noise filtering framework, semantic enrichment helpers, and CLI wiring for binary analysis with multiple output formats (Table, JSON, YARA), complemented by CI/workflow automation updates and integration test coverage.

Changes

Cohort / File(s)	Summary
CI & Workflow Automation `.github/workflows/ci.yml`, `.github/workflows/docs.yml`, `.github/workflows/security.yml`, `justfile`	Unified CI commands to use `just` recipes (e.g., `just test-ci`, `just build-release`, `just coverage`) as single source of truth; added `--all-features` flags to test/build targets and new `outdated` target.
Documentation & Planning `AGENTS.md`, `docs/plans/2026-02-16-extraction-refactor-and-cli.md`	Added CI Architecture documentation describing justfile-driven workflows and merge protections; comprehensive multi-phase refactor plan detailing module modularization (Phase 1), cross-cutting extraction (Phase 2), and CLI wiring (Phase 3) with commit sequences and verification steps.
Extraction Traits & Architecture `src/extraction/traits.rs`, `src/extraction/mod.rs`	Introduced StringExtractor trait with `extract()`, `extract_from_section()`, and `extract_canonical()` methods; ExtractionConfig with validation; exposed traits via public re-export; reorganized top-level module with new traits and basic_extractor submodules.
ASCII Extraction Modularization `src/extraction/ascii/mod.rs`, `src/extraction/ascii/extraction.rs`, `src/extraction/ascii/tests.rs`	Restructured flat `ascii.rs` (832 lines removed) into modular hierarchy with configuration struct, extraction functions, section-aware processing, printable ASCII utility, and comprehensive unit/edge-case tests (380 lines).
UTF-16 Extraction Modularization `src/extraction/utf16/mod.rs`, `src/extraction/utf16/config.rs`, `src/extraction/utf16/extraction.rs`, `src/extraction/utf16/validation.rs`, `src/extraction/utf16/confidence.rs`, `src/extraction/utf16/tests.rs`	Replaced flat `utf16.rs` (1273 lines removed) with modular architecture: ByteOrder enum, configuration, extraction with section metadata, UTF-16 validation/surrogate handling, confidence scoring heuristics, and comprehensive tests (238 lines) covering LE/BE/Auto detection and alignment scanning.
Deduplication Modularization `src/extraction/dedup/mod.rs`, `src/extraction/dedup/scoring.rs`, `src/extraction/dedup/tests/mod.rs`, `src/extraction/dedup/tests/basic.rs`, `src/extraction/dedup/tests/scoring.rs`	Restructured flat `dedup.rs` (838 lines removed) into modular structure with CanonicalString/StringOccurrence types, deduplication logic with threshold-based grouping, scoring function (base + bonuses for occurrences/cross-section/multi-source/confidence), and extensive test coverage (520+ lines) for edge cases, bonuses, and threshold behavior.
Noise Filtering Framework `src/extraction/filters/mod.rs`, `src/extraction/filters/implementations.rs`, `src/extraction/filters/tests.rs`	Introduced FilterContext struct carrying section metadata, NoiseFilter trait with `calculate_confidence()`, and CompositeNoiseFilter aggregating six sub-filters (Entropy, CharDistribution, Linguistic, Length, Repetition, Context) with configurable weights and enabling flags; includes 131 lines of filter integration tests.
PE Resource Extraction Modularization `src/extraction/pe_resources/mod.rs`, `src/extraction/pe_resources/detection.rs`, `src/extraction/pe_resources/manifests.rs`, `src/extraction/pe_resources/string_tables.rs`, `src/extraction/pe_resources/version_info.rs`, `src/extraction/pe_resources/tests/...`	Replaced flat `pe_resources.rs` (1449 lines removed) with modular structure: resource enumeration (Phase 1 detection), string extraction (Phase 2) for VERSIONINFO/STRINGTABLE/MANIFEST via pelite, manifest encoding detection, UTF-16LE decoding, and comprehensive test hierarchy (640+ lines) with fixture-driven detection/extraction validation.
BasicExtractor Implementation `src/extraction/basic_extractor.rs`, `src/extraction/tests.rs`	Implemented StringExtractor trait with multi-encoding orchestration (ASCII/UTF-8/UTF-16LE/UTF-16BE per section), section prioritization, symbol inclusion (ImportName/ExportName), optional noise filtering/deduplication, semantic enrichment, and canonicalization support; includes 475 lines of integration tests validating extraction across encodings, filtering, symbols, and boundaries.
Semantic Enrichment & Helpers `src/extraction/helpers.rs`	Added semantic enrichment function integrating SemanticClassifier and SymbolDemangler with StringContext; UTF-8/ASCII extraction helpers with min/max length filtering and printable text validation; supports noise-filter based confidence scoring (273 lines).
CLI Implementation & Integration `src/main.rs`, `tests/integration_cli.rs`	Implemented CLI with output format selection (Table/Json/Yara), min_length configuration, extraction pipeline orchestration via BasicExtractor, binary format detection, and output formatting; added four CLI integration tests validating file handling, JSON output validity, error cases, and min_length filtering behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Rationale: This refactor spans 50+ files with heterogeneous changes: comprehensive module restructuring (ASCII/UTF-16/PE resources/dedup), trait-based architecture design, complex extraction pipelines with multi-encoding support and configurable filtering, new noise filtering framework with six sub-filters and scoring heuristics, PE resource modularization requiring Pelite integration understanding, and CLI wiring. The density of interconnected logic—section-aware metadata population, confidence computation across multiple heuristics, threshold-based deduplication with bonus scoring, surrogate-pair handling in UTF-16—requires careful verification of correctness, false-positive reduction in classification, and proper error handling in parsing/decoding paths.

Possibly related PRs

feat(classification): add semantic pattern matching for GUID, email, base64, format strings, and user agents #131: Modifies extraction type signatures and semantic classification APIs; directly related through StringContext and SemanticClassifier integration patterns used in helpers module.
chore(ci): add OSSF and CI best practices #132: Modifies CI configuration and AGENTS.md documentation; overlaps with workflow and CI tooling changes in this PR.
feat(parse): implement string deduplication with metadata preservation #117: Implements string deduplication feature (CanonicalString/StringOccurrence/deduplicate); directly mirrors the dedup module restructuring introduced here.

Poem

🧬 Strings extracted from binaries deep,
Modules modularized, secrets to keep,
UTF-16 dances, ASCII in line,
Traits and filters make extraction shine,
From PE resources to CLI—one design. ✨

🚥 Pre-merge checks | ✅ 6 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Error Handling	⚠️ Warning	PR violates error handling conventions by returning Vec instead of Result for PE resource extraction functions, silently suppressing parsing errors from pelite.	Convert all resource extraction functions to return Result<Vec, StringyError> and replace error suppression patterns with proper ? operator propagation.
Ascii Only	⚠️ Warning	Pull request introduces Unicode smart quotes (U+201D, U+201C) and construction emoji (🚧) violating ASCII-only requirement across multiple source files and documentation.	Replace smart quotes with ASCII equivalents ("), remove emoji, and enforce ASCII-only strings via project linter before merge.

✅ Passed checks (6 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title follows Conventional Commits format with valid type (refactor) and scope (extraction), and accurately describes the main changes: module splitting and CLI wiring.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 70.00%.
No Unsafe Code	✅ Passed	The crate enforces safety through forbid(unsafe_code) declarations in src/lib.rs and src/main.rs, preventing compilation of any unsafe constructs.
File Size Limit	✅ Passed	All extraction module files comply with 500-line limit; largest at 497 lines. No blanket #[allow] attributes detected. Refactor achieves focused, maintainable module organization.
Section Weight Consistency	✅ Passed	PR does not modify container parser files; conditional requirement for weight consistency verification is not triggered.
Description check	✅ Passed	The PR description comprehensively documents all major changes: CI workflow updates to use justfile recipes, systematic refactoring of six oversized extraction modules into focused subdirectories, CLI pipeline wiring with format and length options, and resolution of eight design debt items.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix_dist_issues

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mergify · 2026-02-17T15:11:56Z

codecov · 2026-02-17T15:15:56Z

Codecov Report

❌ Patch coverage is 84.04385% with 262 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/extraction/pe_resources/manifests.rs	21.15%	82 Missing ⚠️
src/extraction/utf16/validation.rs	65.95%	48 Missing ⚠️
src/extraction/traits.rs	58.13%	36 Missing ⚠️
src/extraction/pe_resources/detection.rs	66.66%	33 Missing ⚠️
src/extraction/basic_extractor.rs	90.78%	21 Missing ⚠️
src/extraction/pe_resources/string_tables.rs	90.00%	11 Missing ⚠️
src/extraction/utf16/extraction.rs	93.45%	11 Missing ⚠️
src/extraction/utf16/confidence.rs	89.47%	6 Missing ⚠️
src/extraction/helpers.rs	96.71%	5 Missing ⚠️
src/extraction/utf16/mod.rs	90.32%	3 Missing ⚠️
... and 5 more

📢 Thoughts on this report? Let us know!

mergify · 2026-02-17T15:16:37Z

🧪 CI Insights

Here's what we observed from your CI run for 9c5da41.

🟢 All jobs passed!

But CI Insights is watching 👀

coderabbitai

Actionable comments posted: 33

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/extraction/filters/implementations.rs (1)
297-302: 🧹 Nitpick | 🔵 Trivial

Redundant cast - total is already f32.

Line 299 has (total as usize) as f32 but total is already declared as f32 on line 278. This double-cast is unnecessary.
Suggested fix
-            let min_pattern_len_for_3_reps = ((total as usize) as f32 / 3.0).ceil() as usize;
+            let min_pattern_len_for_3_reps = (total / 3.0).ceil() as usize;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/extraction/filters/implementations.rs` around lines 297 - 302, The
expression computing min_pattern_len_for_3_reps contains redundant casts of
total; since total is already an f32, replace the double-cast `((total as usize)
as f32 / 3.0).ceil() as usize` with a direct floating calculation like `(total /
3.0).ceil() as usize` in the block that uses variables total and max_pattern_len
so min_pattern_len_for_3_reps is computed without needless casting.
src/extraction/mod.rs (1)
141-151: ⚠️ Potential issue | 🟠 Major

Address re-export naming inconsistency for extract_from_section.

The re-export in mod.rs aliases UTF-16's extract_from_section to extract_utf16_from_section (line 149) but leaves ASCII's version unaliased (line 141). However, test code imports directly from submodules (e.g., test_utf16_extraction.rs imports extract_from_section from the UTF-16 module), bypassing the top-level aliases entirely. This defeats the aliasing intent and violates the guideline requiring imports from stringy::extraction, not nested paths.

Ensure consistent re-export naming: either alias both functions (ASCII as extract_ascii_from_section, UTF-16 as extract_utf16_from_section), or export both as extract_from_section with qualified module paths. Update test imports to use top-level re-exports via stringy::extraction to align with project conventions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/extraction/mod.rs` around lines 141 - 151, The top-level re-export naming
is inconsistent: ascii::extract_from_section is exported as extract_from_section
while utf16::extract_from_section is aliased to extract_utf16_from_section;
update mod.rs to make these symmetric and adjust tests to import from
stringy::extraction. Either (recommended) alias both exports to explicit names
(extract_ascii_from_section and extract_utf16_from_section) by changing the
ascii re-export to export extract_from_section as extract_ascii_from_section and
keep the utf16 alias extract_utf16_from_section, or export both as
extract_from_section within their modules and update tests to import via the
top-level path (stringy::extraction::{extract_ascii_from_section,
extract_utf16_from_section} or stringy::extraction::extract_from_section if you
choose unambiguous names); ensure tests like test_utf16_extraction.rs import
from stringy::extraction instead of nested module paths.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@AGENTS.md`:
- Around line 70-77: Update the first bullet under "CI Architecture" to state
that CI workflows use `just` recipes as the single source of truth except for
Quality/MSRV jobs; explicitly mention that Quality/MSRV jobs use
`dtolnay/rust-toolchain` and must not use `just` recipes because they rely on
`mise exec --` which conflicts, so readers immediately see the exception before
the separate bullet about Quality/MSRV.

In `@docs/plans/2026-02-16-extraction-refactor-and-cli.md`:
- Around line 59-63: Several fenced code blocks (including the one containing
the git add/git commit snippet) lack language specifiers; update those fenced
blocks to include an appropriate shell language tag (e.g., ```bash or ```sh) so
tooling and syntax highlighting work correctly, ensuring you add the specifier
to both opening fences and any other command/code blocks referenced in this
document.

In `@src/extraction/ascii/extraction.rs`:
- Around line 120-169: The end-of-buffer branch duplicates FoundString
construction; extract a helper function or closure (e.g., make_found_string)
that takes the taken bytes, start, len and returns a FoundString, then replace
the duplicated blocks in the max_length branch and the else branch (and any
other places in this loop) with strings.push(make_found_string(bytes, start,
len)); ensure the helper sets the same fields used now (text from
String::from_utf8, Encoding::Ascii, offset from current_string_start, length,
source StringSource::SectionData, confidence 1.0, etc.) and use
std::mem::take(&mut current_string_bytes) only once per invocation.

In `@src/extraction/ascii/tests.rs`:
- Around line 277-293: In test_extract_from_section_rva_calculation replace the
conditional guard so the test fails if no strings were extracted: after calling
create_test_section, AsciiExtractionConfig::default, and extract_from_section,
assert that strings is non-empty (e.g. assert!(!strings.is_empty()) or
assert_eq!(strings.len(), 1)) and then perform the existing assertions on
strings[0].offset and strings[0].rva; this removes the silent pass and ensures
extract_from_section actually produced the expected string.

In `@src/extraction/basic_extractor.rs`:
- Around line 19-125: The extract() and extract_canonical() implementations in
BasicExtractor duplicate section sorting, filtering, symbol inclusion and
semantic enrichment; extract that shared logic into a private helper fn (e.g.,
extract_all_strings(&self, data: &[u8], container_info: &ContainerInfo, config:
&ExtractionConfig) -> Result<Vec<FoundString>>) which performs the section sort
(same key logic), applies the Debug/code filters, calls
self.extract_from_section for each section, includes import/export symbols when
config.include_symbols is true, and calls apply_semantic_enrichment; then have
extract() call this helper and only run deduplication via
deduplicate(...)/to_found_string conversion when config.enable_deduplication,
and have extract_canonical() call the same helper but return the
canonical/deduped output as needed.
- Around line 277-278: The code is checking both config.encodings and
config.enabled_encodings (e.g., in the ascii_enabled and utf8_enabled
computations using Encoding::Ascii/Encoding::Utf8), indicating legacy dual-field
handling; consolidate to a single source of truth by standardizing all checks to
one field (pick config.encodings or rename to config.enabled_encodings
consistently), update occurrences where ascii_enabled and utf8_enabled are
computed to reference only the canonical field, and add a brief inline comment
like "// TODO: deprecate enabled_encodings" next to the kept field to signal
migration; ensure any helper functions or downstream uses (the variables
ascii_enabled, utf8_enabled) are updated to the single-field logic so behavior
is unchanged.
- Around line 261-264: The current overflow handling for computing end_offset
uses checked_add(...).unwrap_or(data.len()).min(data.len()) which is defensive
but can be clearer; change the computation in the end_offset assignment to use
saturating_add on section_offset (i.e.,
section_offset.saturating_add(section_size)) and then .min(data.len()) to
express explicit saturation semantics for end_offset (referencing end_offset,
section_offset, section_size, and data.len()).

In `@src/extraction/dedup/mod.rs`:
- Around line 212-236: to_found_string currently index-accesses
self.occurrences[0] which panics if CanonicalString.occurrences is empty; change
to use self.occurrences.first() and make the function return Option<FoundString>
(or Result) so it can return None when there are no occurrences, compute
max_confidence by iterating over occurrences (or return None early), and build
the FoundString from the first occurrence when present (preserving fields like
text, original_text, encoding, offset, rva, section, length, tags = merged_tags,
score = combined_score, source, confidence = max_confidence); update callers of
to_found_string accordingly.
- Around line 228-237: The code unnecessarily clones `fs` before calling
`found_string_to_occurrence(fs.clone())` inside the iterator consumer; since
`fs` is already moved by `into_iter()` you should pass `fs` directly to
`found_string_to_occurrence` to avoid an extra clone—locate the iterator that
calls `into_iter()` and replace `found_string_to_occurrence(fs.clone())` with
`found_string_to_occurrence(fs)` (the surrounding context includes identifiers
like found_string_to_occurrence, into_iter, fs, merged_tags, and
combined_score).

In `@src/extraction/dedup/tests.rs`:
- Around line 17-27: The UTF-16 byte-length computation in the match for
Encoding (where length is set from text.chars().count() * 2) undercounts
characters that need surrogate pairs; either update the calculation to use
text.encode_utf16().count() * 2 to count UTF-16 code units correctly, or keep
the current code but change the comment to explicitly state it assumes BMP-only
characters for these tests; modify the match arm and the inline comment near the
Encoding handling (the length variable and the text.* call) accordingly.
- Line 6: Add an inline justification comment for the
#[allow(clippy::too_many_arguments)] attribute on the test helper so it
documents why many parameters are necessary (e.g., "helper constructs test data
with full metadata; multiple args required for clarity and coverage"). Place the
comment immediately after the attribute on the same line or directly above it,
referencing the existing attribute #[allow(clippy::too_many_arguments)] so
reviewers can see the rationale for the exception.

In `@src/extraction/helpers.rs`:
- Around line 124-125: Add an inline justification comment for the
#[allow(clippy::collapsible_if)] attribute where it appears (the conditional
checking current_string_bytes.len() against min_length and max_length and the
similar block later) by appending a short rationale on the same line or
immediately after the attribute explaining why collapsing the ifs would break
the borrow/ownership intent (e.g., "allow because collapsing the ifs would
change borrow lifetimes and cause borrow-checker errors"); update both
occurrences to include this inline justification.

In `@src/extraction/pe_resources/manifests.rs`:
- Around line 68-98: The inline UTF-16BE decoding in decode_manifest should be
extracted into a shared helper (e.g., pub(crate) fn decode_utf16be(bytes: &[u8])
-> Result<String>) alongside the existing decode_utf16le in the parent
pe_resources module; implement the same logic currently in the match arm
(chunks_exact(2) -> u16::from_be_bytes -> String::from_utf16 -> trim nulls ->
map Err to StringyError::EncodingError) and then replace the UTF-16BE arm in
decode_manifest to call decode_utf16be(data) for consistency and reuse.

In `@src/extraction/pe_resources/tests.rs`:
- Around line 107-113: The test function
test_detect_version_info_empty_directory is empty and provides no coverage;
either remove it or implement it by creating a minimal fixture (an empty
RT_VERSION directory) and asserting the behavior of the code path (e.g., call
extract_resources(...) or the detect_version_info(...) helper used by
pe_resources and assert it returns no entries/does not panic). Update the test
to set up the empty directory, invoke extract_resources/detect_version_info, and
add an assertion on the returned result (empty list/Ok value) or delete the
empty test; apply the same fix to the other empty tests in tests.rs.
- Around line 1-594: The test file is getting large and should be split for
maintainability; extract resource-detection tests into detection_tests.rs and
string-extraction tests into extraction_tests.rs. Move tests that primarily
exercise extract_resources, VERSIONINFO/STRINGTABLE/MANIFEST detection and
fixture-based resource metadata checks into detection_tests.rs (any tests
referencing extract_resources, ResourceType, ResourceMetadata assertions), and
move Phase 2 string-focused tests (decode_utf16le, detect_manifest_encoding,
extract_version_info_strings, extract_string_table_strings,
extract_manifest_strings, extract_resource_strings and their fixture-based
variants) into extraction_tests.rs; keep or move the shared get_fixture_path
helper into a small tests/mod.rs or a shared utils file and update the tests
module declarations to pub use/declare those helpers so both new files can
access it. Finally, update the tests mod (or Cargo test configuration) to
include the two new files so cargo test still discovers all tests.
- Around line 99-105: The test computes `_has_version` from
`resources.iter().any(|r| matches!(r.resource_type, ResourceType::VersionInfo))`
but never asserts it; change this to an assertion that verifies absence (e.g.,
`assert!(!resources.iter().any(|r| matches!(r.resource_type,
ResourceType::VersionInfo)))`) or assign to `has_version` and
`assert_eq!(has_version, false)` so the test explicitly checks for no
VERSIONINFO resources; update the assertion in the test function that uses
`resources` to reflect this.
- Around line 363-370: The assertion is tautological because
`!resources.is_empty()` makes the `has_version_info`/`has_string_table` checks
redundant; change the check to explicitly detect other resource types instead.
Replace the `assert!(has_version_info || has_string_table ||
!resources.is_empty(), ...)` with a check that uses `resources.iter().any(|r|
!matches!(r.resource_type, ResourceType::VersionInfo |
ResourceType::StringTable))`, i.e. assert `has_version_info || has_string_table
|| resources.iter().any(|r| !matches!(r.resource_type, ResourceType::VersionInfo
| ResourceType::StringTable))` so you only pass when either known types are
present or some other resource type exists (referencing `has_version_info`,
`has_string_table`, `resources` and
`ResourceType::VersionInfo/ResourceType::StringTable`).

In `@src/extraction/pe_resources/version_info.rs`:
- Around line 21-35: In extract_version_info_strings, currently all Err branches
for PeFile::from_bytes, pe.resources(), and resources.version_info() silently
return an empty Vec; change these to log the error (at debug/trace) including
the returned Err value so parsing failures are visible while still returning an
empty Vec for the caller; specifically update the match arms around
PeFile::from_bytes, pe.resources(), and resources.version_info() to capture the
error (e.g., err) and call the project logger (or tracing::debug!) with context
like "failed to parse PE", "failed to get resources", or "failed to get
version_info" plus the err, keeping the final behavior of
extract_version_info_strings unchanged.

In `@src/extraction/tests.rs`:
- Around line 199-236: The ExtractionConfig API is redundant: remove the
obsolete encodings field and update all uses to rely solely on
enabled_encodings; update the ExtractionConfig struct and its Default impl to
drop encodings, change BasicExtractor's checks (e.g. the
ascii_enabled/utf8_enabled/utf16le_enabled/utf16be_enabled logic in
basic_extractor.rs) to reference only config.enabled_encodings.contains(...),
and update tests (e.g. test_basic_extractor_encoding_filtering in tests.rs) to
construct ExtractionConfig with enabled_encodings only. Ensure any
serialization/constructor code and pattern matches expecting encodings are
adjusted or migrated to enabled_encodings to avoid compilation errors.

In `@src/extraction/traits.rs`:
- Around line 43-58: The struct currently has two redundant fields, encodings
and enabled_encodings (both Vec<Encoding> with the same defaults); remove one to
eliminate ambiguity (prefer keeping the more clearly named enabled_encodings) or
consolidate them by renaming/merging encodings into enabled_encodings and update
all usages accordingly (search for references to encodings and
enabled_encodings, e.g., in parsing, defaults, and tests), preserve the original
default values and behavior when merging, and update the doc comment for the
remaining field to explain its purpose and remove the duplicate doc entry.
- Around line 119-156: ExtractionConfig::validate currently misses checks for
inverted/zero bounds and dedup settings; update validate() to ensure max_length
> 0, max_length >= min_length (or strictly > if you prefer to forbid equality),
and if dedup_threshold is Some(t) then t > 0 (reject zero). Specifically, inside
the validate method add checks referencing self.max_length (ensure > 0 and >=
self.min_length or > self.min_length) and self.dedup_threshold (use map or match
to assert t > 0 and return StringyError::ConfigError with a clear message on
failure) so invalid configs are rejected.

In `@src/extraction/utf16/confidence.rs`:
- Around line 90-94: The check for chars.is_empty() is redundant because the
function already returns for empty input earlier (lines ~68-70); remove the
if/else and simply return printable_count as f32 / chars.len() as f32 (using the
existing chars and printable_count identifiers) so the unreachable branch is
eliminated and the calculation is simplified.

In `@src/extraction/utf16/extraction.rs`:
- Around line 95-98: Add an inline justification comment for the
#[allow(clippy::result_unit_err)] on the decode_utf16le_bytes function: explain
that the unit error type is intentional because decode_utf16le returns no
actionable error information and callers only need success/failure, referencing
decode_utf16le and decode_utf16le_bytes to locate the rationale; keep the
comment brief and on the same line or immediately above the attribute.

In `@src/extraction/utf16/mod.rs`:
- Around line 86-98: The current add_with_dedup closure does O(n²) work by
scanning strings with strings.iter_mut().find(); replace it with a HashMap-based
dedup: create a HashMap keyed by (FoundString.offset, FoundString.encoding,
FoundString.text) mapping to the index in strings, update that map as you push
or replace entries, and when a duplicate is seen compare confidence to
strings[index].confidence and either replace strings[index] or skip; update or
insert the map entry accordingly (use the existing FoundString fields offset,
encoding, text, confidence and the add_with_dedup closure to locate where to
change the logic).

In `@src/extraction/utf16/tests.rs`:
- Around line 218-224: Update the test_utf16_odd_length_data test to assert a
concrete expected outcome instead of only ensuring no panic: call
extract_utf16_strings(data, &Utf16ExtractionConfig::default()) and assert that
the returned Vec either is empty or contains the partial decoded string "He"
(since input pairs decode to 'H','e' and the final 0x6C is an incomplete pair),
e.g. assert!(strings.is_empty() || strings.contains(&"He".to_string()));
reference the test function name, Utf16ExtractionConfig, and
extract_utf16_strings to locate and update the test.
- Around line 132-140: The test currently asserts be_string.is_some() and then
enters an if let Some(s) to re-check s.text and s.encoding redundantly; replace
the assert! + if-let pattern with a single unwrap (or use expect) on the result
of strings.iter().find(...) and then assert_eq! directly on the unwrapped value
to remove duplicated checks — locate the be_string variable created from
strings.iter().find(...), the Encoding::Utf16Be reference, and the subsequent
assert_eq! calls to update accordingly.

In `@src/extraction/utf16/validation.rs`:
- Around line 133-134: The #[allow(dead_code)] on is_valid_utf16_sequence lacks
the required justification; add a brief comment above the attribute explaining
why the function is intentionally kept (for example: "kept for future use",
"used by tests/benchmarks", or "exported for FFI but referenced conditionally")
so the attribute is justified and complies with guidelines; update the comment
to reference the function name is_valid_utf16_sequence to make the intent clear.
- Around line 116-120: The legacy wrapper is_printable_utf16le_char currently
passes None for next_code_unit so it cannot validate UTF-16 surrogate pairs
(high surrogates 0xD800–0xDBFF will always return false); update the function's
docstring to explicitly state this limitation and that callers needing
surrogate-pair validation should call is_printable_code_unit_or_pair (or use a
variant that accepts a next_code_unit) instead of relying on this
backward-compatibility wrapper. Mention the behavior for high and low surrogates
and the recommended alternative function name (is_printable_code_unit_or_pair)
in the docs.
- Around line 225-229: The branch checking chars.is_empty() at the end is
unreachable because an early return already handles the empty case; remove the
if/else and replace it with a single return of valid_count as f32 / chars.len()
as f32 (using the existing chars and valid_count variables) so the function
simply computes and returns the ratio without the redundant empty check.

In `@src/main.rs`:
- Around line 47-49: Change the short flag for the min_length CLI option to
avoid ambiguity: update the #[arg(...)] attribute on the min_length field
(currently #[arg(long, short = 'l', default_value_t = 4)]) to use a clearer
short flag such as 'n' or 'm' (e.g., short = 'n'), and update any related
usages, help text, README examples, and tests that reference the old -l flag to
the new short form to keep CLI docs and behavior consistent.
- Around line 74-75: OutputMetadata::new is being called with strings.len() for
both total_count and filtered_count, causing identical values; either document
this is intentional or pass a pre-filter total count instead. If there truly is
no post-extraction filtering, add a brief comment near the call (referencing
OutputMetadata::new, binary_name, output_format, and strings.len()) stating
total and filtered counts are intentionally identical; otherwise compute and use
the pre-filter total (capture the length of the raw/extracted collection before
any filtering into a variable like total_count) and pass that as the first
argument while passing strings.len() as filtered_count.

In `@tests/integration_cli.rs`:
- Around line 33-41: The test cli_invalid_file currently only checks the exit
status; update it to also validate stderr from
Command::new(env!("CARGO_BIN_EXE_stringy")) by converting output.stderr (bytes)
to a UTF-8 string and asserting it contains a helpful error message (e.g.,
non-empty or matches an expected substring such as "No such file" or "error");
modify the test to read let stderr = String::from_utf8_lossy(&output.stderr) and
add an assertion like assert!(!stderr.trim().is_empty(), ...) or
assert!(stderr.contains("expected substring"), ...) to ensure user-facing
diagnostics are present.
- Around line 43-51: The cli_min_length_flag test currently only asserts
success; update the test (cli_min_length_flag) to also read output.stdout (from
Command::new(...).output()), convert it to a UTF-8 string, split into lines (or
parse the expected output format), and assert that every returned string/item
has length >= 20 (or at least one string meets that criteria depending on
expected behavior). Use the existing Command invocation and output variable
names (output) and add assertions on String::from_utf8_lossy(&output.stdout) (or
equivalent) to verify the min-length flag actually filtered/affected results.

---

Outside diff comments:
In `@src/extraction/filters/implementations.rs`:
- Around line 297-302: The expression computing min_pattern_len_for_3_reps
contains redundant casts of total; since total is already an f32, replace the
double-cast `((total as usize) as f32 / 3.0).ceil() as usize` with a direct
floating calculation like `(total / 3.0).ceil() as usize` in the block that uses
variables total and max_pattern_len so min_pattern_len_for_3_reps is computed
without needless casting.

In `@src/extraction/mod.rs`:
- Around line 141-151: The top-level re-export naming is inconsistent:
ascii::extract_from_section is exported as extract_from_section while
utf16::extract_from_section is aliased to extract_utf16_from_section; update
mod.rs to make these symmetric and adjust tests to import from
stringy::extraction. Either (recommended) alias both exports to explicit names
(extract_ascii_from_section and extract_utf16_from_section) by changing the
ascii re-export to export extract_from_section as extract_ascii_from_section and
keep the utf16 alias extract_utf16_from_section, or export both as
extract_from_section within their modules and update tests to import via the
top-level path (stringy::extraction::{extract_ascii_from_section,
extract_utf16_from_section} or stringy::extraction::extract_from_section if you
choose unambiguous names); ensure tests like test_utf16_extraction.rs import
from stringy::extraction instead of nested module paths.

coderabbitai · 2026-02-17T15:17:26Z

+```
+git add src/extraction/traits.rs src/extraction/mod.rs
+git commit -s -m "refactor(extraction): extract traits and config to traits.rs"
+```
+


🧹 Nitpick | 🔵 Trivial

Add language specifiers to fenced code blocks.

Multiple code blocks throughout the document lack language specifiers (e.g., lines 63, 103, 142, etc.). Adding bash or shell for command blocks improves syntax highlighting and tooling support.

-``` +```bash git add src/extraction/traits.rs src/extraction/mod.rs git commit -s -m "refactor(extraction): extract traits and config to traits.rs"

<details> <summary>🧰 Tools</summary> <details> <summary>🪛 markdownlint-cli2 (0.20.0)</summary> [warning] 63-63: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- [warning] 59-59: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> <details> <summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/plans/2026-02-16-extraction-refactor-and-cli.md around lines 59 - 63,
Several fenced code blocks (including the one containing the git add/git commit
snippet) lack language specifiers; update those fenced blocks to include an
appropriate shell language tag (e.g., bash or sh) so tooling and syntax
highlighting work correctly, ensuring you add the specifier to both opening
fences and any other command/code blocks referenced in this document.

</details>  

coderabbitai · 2026-02-17T15:17:26Z

+    // Handle string at end of buffer
+    if let Some(start) = current_string_start {
+        let len = current_string_bytes.len();
+        if len >= config.min_length {
+            // Check maximum length if configured
+            if let Some(max_len) = config.max_length {
+                if len > max_len {
+                    // Skip this string
+                } else {
+                    let bytes = std::mem::take(&mut current_string_bytes);
+                    let text = String::from_utf8(bytes).expect("ASCII bytes should be valid UTF-8");
+                    strings.push(FoundString {
+                        text,
+                        original_text: None,
+                        encoding: Encoding::Ascii,
+                        offset: start as u64,
+                        rva: None,
+                        section: None,
+                        length: len as u32,
+                        tags: Vec::new(),
+                        score: 0,
+                        section_weight: None,
+                        semantic_boost: None,
+                        noise_penalty: None,
+                        source: StringSource::SectionData,
+                        confidence: 1.0,
+                    });
+                }
+            } else {
+                let bytes = std::mem::take(&mut current_string_bytes);
+                let text = String::from_utf8(bytes).expect("ASCII bytes should be valid UTF-8");
+                strings.push(FoundString {
+                    text,
+                    original_text: None,
+                    encoding: Encoding::Ascii,
+                    offset: start as u64,
+                    rva: None,
+                    section: None,
+                    length: len as u32,
+                    tags: Vec::new(),
+                    score: 0,
+                    section_weight: None,
+                    semantic_boost: None,
+                    noise_penalty: None,
+                    source: StringSource::SectionData,
+                    confidence: 1.0,
+                });
+            }
+        }
+    }


🧹 Nitpick | 🔵 Trivial

End-of-buffer handling duplicates FoundString construction.

The FoundString creation is duplicated across the max_length check branches and the main loop. Consider extracting a helper closure or function to reduce duplication.

♻️ Suggested refactor to reduce duplication

+ // Helper to create FoundString from accumulated bytes + let make_found_string = |bytes: Vec<u8>, start: usize, len: usize| -> FoundString { + let text = String::from_utf8(bytes).expect("ASCII bytes should be valid UTF-8"); + FoundString { + text, + original_text: None, + encoding: Encoding::Ascii, + offset: start as u64, + rva: None, + section: None, + length: len as u32, + tags: Vec::new(), + score: 0, + section_weight: None, + semantic_boost: None, + noise_penalty: None, + source: StringSource::SectionData, + confidence: 1.0, + } + };

Then use strings.push(make_found_string(bytes, start, len)); in both locations.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/extraction/ascii/extraction.rs` around lines 120 - 169, The end-of-buffer branch duplicates FoundString construction; extract a helper function or closure (e.g., make_found_string) that takes the taken bytes, start, len and returns a FoundString, then replace the duplicated blocks in the max_length branch and the else branch (and any other places in this loop) with strings.push(make_found_string(bytes, start, len)); ensure the helper sets the same fields used now (text from String::from_utf8, Encoding::Ascii, offset from current_string_start, length, source StringSource::SectionData, confidence 1.0, etc.) and use std::mem::take(&mut current_string_bytes) only once per invocation.

coderabbitai · 2026-02-17T15:17:29Z

+#[test]
+fn cli_min_length_flag() {
+    let output = Command::new(env!("CARGO_BIN_EXE_stringy"))
+        .args(["tests/fixtures/test_binary_elf", "-l", "20"])
+        .output()
+        .expect("Failed to execute stringy");
+
+    assert!(output.status.success(), "Exit code: {}", output.status);
+}


🧹 Nitpick | 🔵 Trivial

min_length test doesn't verify the flag's effect.

The test only checks exit success but doesn't verify that output strings are actually >= 20 characters. Without this, the test wouldn't catch if the flag was silently ignored.

Suggested improvement

#[test] fn cli_min_length_flag() { let output = Command::new(env!("CARGO_BIN_EXE_stringy")) .args(["tests/fixtures/test_binary_elf", "-l", "20"]) .output() .expect("Failed to execute stringy"); assert!(output.status.success(), "Exit code: {}", output.status); + + // Verify all output strings meet minimum length + let stdout = String::from_utf8_lossy(&output.stdout); + for line in stdout.lines().filter(|l| !l.is_empty()) { + // In table format, the string is typically the first column + // This is a basic sanity check - adjust based on actual output format + assert!( + line.len() >= 20 || line.starts_with("Offset") || line.contains("---"), + "Output line should be >= 20 chars or be a header: {}", + line + ); + } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#[test]

fn cli_min_length_flag() {

let output = Command::new(env!("CARGO_BIN_EXE_stringy"))

.args(["tests/fixtures/test_binary_elf", "-l", "20"])

.output()

.expect("Failed to execute stringy");

assert!(output.status.success(), "Exit code: {}", output.status);

}

#[test]

fn cli_min_length_flag() {

let output = Command::new(env!("CARGO_BIN_EXE_stringy"))

.args(["tests/fixtures/test_binary_elf", "-l", "20"])

.output()

.expect("Failed to execute stringy");

assert!(output.status.success(), "Exit code: {}", output.status);

// Verify all output strings meet minimum length

let stdout = String::from_utf8_lossy(&output.stdout);

for line in stdout.lines().filter(|l| !l.is_empty()) {

// In table format, the string is typically the first column

// This is a basic sanity check - adjust based on actual output format

assert!(

line.len() >= 20 || line.starts_with("Offset") || line.contains("---"),

"Output line should be >= 20 chars or be a header: {}",

line

);

}

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/integration_cli.rs` around lines 43 - 51, The cli_min_length_flag test currently only asserts success; update the test (cli_min_length_flag) to also read output.stdout (from Command::new(...).output()), convert it to a UTF-8 string, split into lines (or parse the expected output format), and assert that every returned string/item has length >= 20 (or at least one string meets that criteria depending on expected behavior). Use the existing Command invocation and output variable names (output) and add assertions on String::from_utf8_lossy(&output.stdout) (or equivalent) to verify the min-length flag actually filtered/affected results.

Copilot

Pull request overview

This PR refactors the extraction module to comply with the 500-line file size limit and wires up the CLI to use the full extraction pipeline. All 6 oversized extraction files (totaling ~6,670 lines) have been split into focused module directories without changing total line count. The CLI now reads binaries, detects format, parses containers, extracts strings, and formats output.

Changes:

Split 6 oversized extraction module files into focused subdirectories (mod.rs + specialized files + tests.rs)
Wired up main.rs to implement the full extraction pipeline with format selection (table/json/yara) and min-length filtering
Updated CI workflows and justfile to use just recipes as single source of truth, fixed cargo-dist invocation

Reviewed changes

Copilot reviewed 38 out of 38 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/main.rs	Implements CLI pipeline: read binary → detect format → parse → extract → format output
tests/integration_cli.rs	New CLI integration tests for file handling, JSON output, error cases, and min-length flag
src/extraction/traits.rs	Extracted ExtractionConfig, StringExtractor trait, and BasicExtractor struct from mod.rs
src/extraction/helpers.rs	Extracted helper functions (semantic enrichment, printable checks, ASCII/UTF-8 extraction)
src/extraction/basic_extractor.rs	Extracted BasicExtractor StringExtractor impl (extract, extract_canonical, extract_from_section)
src/extraction/tests.rs	Extracted 570 lines of tests from mod.rs
src/extraction/filters/*	Split filters.rs into mod.rs + implementations.rs + tests.rs
src/extraction/ascii/*	Split ascii.rs into mod.rs + extraction.rs + tests.rs
src/extraction/utf16/*	Split utf16.rs into 6 files: mod.rs + config.rs + validation.rs + confidence.rs + extraction.rs + tests.rs
src/extraction/dedup/*	Split dedup.rs into mod.rs + scoring.rs + tests.rs
src/extraction/pe_resources/*	Split pe_resources.rs into 6 files: mod.rs + detection.rs + version_info.rs + string_tables.rs + manifests.rs + tests.rs
justfile	Added `--all-features` to build-release/test-ci/coverage, added `outdated` recipe, fixed dist-check to use `dist plan`
.github/workflows/*.yml	Updated CI workflows to use `just` recipes instead of raw cargo commands
AGENTS.md	Documented CI architecture: use `just` recipes in CI, cargo-dist invocation pattern
docs/plans/*.md	Implementation plan documenting the refactoring process

Copilot · 2026-02-17T15:22:19Z

+        .unwrap_or_else(|| cli.input.display().to_string());
+
+    let output_format = cli.format.to_output_format();
+    let metadata = OutputMetadata::new(binary_name, output_format, strings.len(), strings.len());


The OutputMetadata constructor is called with strings.len() for both total_strings and filtered_strings parameters. This is incorrect - all strings are deduplicated/filtered by this point, so there's no distinction between total and filtered. Either pass different values to show actual filtering stats, or if the distinction doesn't apply here, the API should be clarified.

- Fix panic in CanonicalString::to_found_string() on empty occurrences (now returns Option<FoundString>) - Add #[allow] justification comments per project rules (4 files) - Remove redundant empty checks in confidence.rs and validation.rs - Strengthen test assertions: replace conditional passes with asserts, fix unused result, fix tautological assertion, implement empty test - Add max_length validation to ExtractionConfig::validate() - Improve CLI tests: check stderr on error, verify min_length filtering - Clarify AGENTS.md CI Architecture wording Signed-off-by: Keith Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

- Fix UTF-16 length calculation in version_info.rs and string_tables.rs: use encode_utf16().count() * 2 instead of text.len() for source bytes - Fix missing trailing newline in CLI output for table/JSON formatters - Remove unnecessary clone in basic_extractor.rs canonical string path - Use encoding from map key in dedup instead of re-indexing found_strings - Implement empty test bodies for string_tables and manifests detection - Assert unused results in PE resource tests instead of discarding - Add max_length vs min_ascii_length/min_wide_length validation Signed-off-by: Keith Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Copilot

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated 6 comments.

Copilot · 2026-02-18T00:50:59Z

+    let default_output = Command::new(env!("CARGO_BIN_EXE_stringy"))
+        .arg("tests/fixtures/test_binary_elf")
+        .output()
+        .expect("Failed to execute stringy");


cli_min_length_flag compares line counts against default_output, but never asserts that the default invocation succeeded. If the default run fails and returns empty stdout, this assertion can give misleading results. Add an assert!(default_output.status.success(), ...) before comparing outputs.

Copilot · 2026-02-18T00:51:00Z

+                    is_printable_code_unit_or_pair(code_unit, next_code_unit);
+
+                if is_printable {
+                    char_count += 1;
+                    i += consumed_units * 2;


char_count is incremented by 1 even when consumed_units is 2 for surrogate pairs. This makes the length checks (min_length/max_length) effectively count Unicode scalar values, but Utf16ExtractionConfig documents min/max as UTF-16 code units. Either increment by consumed_units (so counts match the config semantics) or update the config docs and related logic to consistently use scalar-count semantics.

Copilot · 2026-02-18T00:51:00Z

+    let config = ExtractionConfig {
+        min_length: cli.min_length,
+        min_ascii_length: cli.min_length,
+        ..ExtractionConfig::default()
+    };


The CLI --min-length currently only updates min_length/min_ascii_length. If UTF-16 extraction is enabled via config, min_wide_length will remain at its default, so the flag will not consistently apply across encodings. Also consider calling config.validate() (or enforcing a non-zero constraint at the clap level) so invalid values fail fast with a clear error.

- Validate min_length at CLI level (clap range 1..) and call config.validate() before extraction - Set min_wide_length from CLI --min-length flag for consistent filtering across all encodings - Assert default CLI run succeeds before comparing output in tests Signed-off-by: Keith Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Use usize directly for min_length CLI field instead of u64 with cast. config.validate() handles the >= 1 constraint. Signed-off-by: Keith Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/main.rs`:
- Around line 47-49: Add a clap-level range validator to the min_length argument
so invalid values (like 0) are rejected at parse time rather than later in
config.validate(); update the struct field attribute for min_length (the
#[arg(...)] on the min_length: usize field) to include a value_parser using
clap::value_parser!(usize).range(1..) so clap fails fast with a clear message.

coderabbitai · 2026-02-18T01:06:36Z

+    /// Minimum string length in bytes (must be >= 1)
+    #[arg(long, short = 'l', default_value_t = 4)]
+    min_length: usize,


🧹 Nitpick | 🔵 Trivial

Consider adding value_parser range constraint for better UX.

Currently min_length accepts any usize, including 0, and relies on config.validate() to reject invalid values. While this works, clap's range parser provides immediate feedback with a clearer error message:

#[arg(long, short = 'l', default_value_t = 4, value_parser = clap::value_parser!(usize).range(1..))] min_length: usize,

This fails fast at argument parsing rather than mid-pipeline. The current approach is functional but shifts validation deeper into the call stack.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/main.rs` around lines 47 - 49, Add a clap-level range validator to the min_length argument so invalid values (like 0) are rejected at parse time rather than later in config.validate(); update the struct field attribute for min_length (the #[arg(...)] on the min_length: usize field) to include a value_parser using clap::value_parser!(usize).range(1..) so clap fails fast with a clear message.

Copilot

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated 5 comments.

Copilot · 2026-02-18T01:09:53Z

+    let config = ExtractionConfig {
+        min_length: cli.min_length,
+        min_ascii_length: cli.min_length,
+        min_wide_length: cli.min_length,
+        ..ExtractionConfig::default()


--min-length is documented as a byte length, but the CLI maps it directly into min_wide_length, which is consumed as a UTF-16 code-unit length by the UTF-16 extractor. This makes UTF-16 filtering stricter than intended (eg min-length=4 bytes should correspond to 2 UTF-16 code units). Consider either (a) keeping separate CLI flags for byte vs UTF-16 code units, or (b) converting cli.min_length to min_wide_length = max(1, (cli.min_length + 1) / 2) and documenting the behavior.

Copilot · 2026-02-18T01:09:53Z

+pub fn deduplicate(
+    strings: Vec<FoundString>,
+    dedup_threshold: Option<usize>,
+    _preserve_all_occurrences: bool,
+) -> Vec<CanonicalString> {


The preserve_all_occurrences argument is currently unused (named _preserve_all_occurrences) but the docstring promises different behavior when it is false ("only store occurrence count instead of full metadata"). This is misleading for callers. Either implement the reduced-metadata mode, or remove the parameter and update the docs/API accordingly.

- Fix STRINGTABLE truncated entry bug: break instead of continue when byte_length exceeds remaining buffer, preventing misaligned reads - Fix stale doc paths after refactoring (ascii/mod.rs, pe_resources/mod.rs) - Rename test module internal_tests -> tests (project convention) - Rename misleading test (no_translation -> fallback_language_handling) - Replace no-op match with actual type assertion in metadata validation - Split pe_resources/tests.rs (632 lines) into tests/ directory module - Split dedup/tests.rs (560 lines) into tests/ directory module - Add test for STRINGTABLE truncated entry parsing behavior Signed-off-by: Keith Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/extraction/ascii/mod.rs`:
- Around line 39-108: The AsciiExtractionConfig currently allows invalid values
(min_length == 0 or max_length < min_length); add validation to prevent these
invariants by implementing either a checked constructor and/or a validate()
method: update AsciiExtractionConfig::new (and/or implement TryFrom/try_new) to
return a Result<AsciiExtractionConfig, StringyError> that rejects min_length < 1
and rejects Some(max) where max < min_length, and update Default (or keep it) to
produce valid defaults; ensure any callers that construct AsciiExtractionConfig
(including use of Default or new) call the fallible constructor or validate() so
invalid configs are rejected upstream.

In `@src/extraction/dedup/tests/basic.rs`:
- Around line 102-112: The test relies on insertion order of occurrences; before
asserting positions, sort the occurrences vector by offset to make the test
deterministic: after obtaining canonical = deduplicate(strings, None, true) and
let occ = &canonical[0].occurrences, sort occ in-place by the offset field
(e.g., using sort_by_key or sort_unstable_by_key on occ.offset) and then perform
the existing assertions on occ[0] and occ[1]; this keeps the change localized to
the test (basic.rs) and references the canonical variable and deduplicate call.

In `@src/extraction/dedup/tests/scoring.rs`:
- Around line 201-250: Update the inline comment in the test_length_preservation
unit test to avoid ambiguity about .text.len(): either reword the comment on the
assertion from "4 characters" to "4 bytes (ASCII)" to be precise about
.text.len() returning byte length for ASCII, or change the assertion check to
use text.chars().count() if you intend to assert character count; reference the
test function name test_length_preservation and the calls to
found.text.len()/.length to locate where to adjust the comment or the check.
- Around line 31-35: The test uses magic numbers for scoring; define and export
module-level constants (e.g., OCCURRENCE_BONUS = 5, CROSS_SECTION_BONUS = 10,
MULTI_SOURCE_BONUS = 15) in the scoring module where calculate_combined_score()
is implemented, then import those constants into the test and replace the
hardcoded 5/10/15 in the deduplicate() test assertion (which checks
canonical[0].combined_score) with the named constants so the test reads base +
OCCURRENCE_BONUS + (confidence calculation) and stays in sync if weights change.

In `@src/extraction/pe_resources/mod.rs`:
- Around line 149-170: extract_resources currently swallows PE parse and
resource access errors and returns an empty Vec; change its signature to return
Result<Vec<ResourceMetadata>, StringyError>, map PeFile::from_bytes errors and
pe.resources() errors into appropriate StringyError variants from types.rs
(e.g., a parse/pe variant and a resource-access variant) with contextual
messages (do not use raw strings), and propagate enumerate_resources results via
? or map_err to a StringyError rather than unwrap_or_default; also update
callers to handle the Result return so they can distinguish "no resources" from
"parse failed".

In `@src/extraction/pe_resources/tests/mod.rs`:
- Around line 14-20: The helper function get_fixture_path is used by multiple
test submodules but is currently private; change its visibility to pub(crate) so
other tests in the crate can access it. Locate the function named
get_fixture_path in src/extraction/pe_resources/tests/mod.rs and add the
pub(crate) visibility modifier to its declaration, keeping the implementation
and signature otherwise unchanged so it remains a crate-scoped test helper.

---

Duplicate comments:
In `@src/extraction/pe_resources/string_tables.rs`:
- Around line 81-100: Change this function to return Result<Vec<FoundString>,
StringyError> instead of Vec so parse failures are not swallowed; replace each
early return like from PeFile::from_bytes, pe.resources(), resources.root(), and
root.get_dir(Name::Id(RT_STRING)) with propagation of a StringyError variant
(use the project's StringyError enum from types.rs) that includes context about
which resource parse step failed; ensure downstream callers of this function are
updated to handle the Result and preserve error context rather than mapping
failures to an empty Vec.
- Around line 68-80: The public function extract_string_table_strings currently
lacks a rustdoc example; either add a short usage example showing calling
extract_string_table_strings(data: &[u8]) and what it returns (Vec<FoundString>)
or, if this is internal-only, reduce its visibility to pub(crate) by changing
the signature from pub fn extract_string_table_strings(...) to pub(crate) fn
extract_string_table_strings(...); ensure any added example references the
FoundString type and a minimal input byte slice to satisfy the exported-API
documentation requirement.
- Around line 149-163: The FoundString construction in string_tables.rs leaves
provenance blank (offset=0, rva=None, section_weight=None); update the code that
builds FoundString so it derives the file offset and RVA from the pelite
resource entry or the containing section headers (use the resource entry's data
pointer/base and section header virtual_address to compute offset and rva),
populate original_text if available, set section to ".rsrc" and set
section_weight to 9.0 for resource strings, and keep source as
StringSource::ResourceString so downstream section-aware scoring can use these
fields (update the FoundString fields offset, rva, section_weight accordingly).

coderabbitai · 2026-02-18T02:02:09Z

+/// Configuration for ASCII string extraction
+///
+/// Controls minimum and maximum string length filtering. This structure serves as the
+/// foundation for future configuration expansion, including encoding preferences and
+/// tag filters as mentioned in the issue.
+///
+/// # Default Values
+///
+/// - `min_length`: 4 (standard minimum to reduce noise)
+/// - `max_length`: None (no upper limit by default)
+///
+/// # Examples
+///
+/// ```rust
+/// use stringy::extraction::ascii::AsciiExtractionConfig;
+///
+/// // Use default configuration
+/// let config = AsciiExtractionConfig::default();
+///
+/// // Custom minimum length
+/// let config = AsciiExtractionConfig::new(8);
+///
+/// // Custom minimum and maximum length
+/// let mut config = AsciiExtractionConfig::default();
+/// config.max_length = Some(256);
+/// ```
+#[derive(Debug, Clone)]
+pub struct AsciiExtractionConfig {
+    /// Minimum string length in bytes (default: 4)
+    pub min_length: usize,
+    /// Maximum string length in bytes (default: None, no limit)
+    pub max_length: Option<usize>,
+}
+
+impl Default for AsciiExtractionConfig {
+    fn default() -> Self {
+        Self {
+            min_length: 4,
+            max_length: None,
+        }
+    }
+}
+
+impl AsciiExtractionConfig {
+    /// Create a new AsciiExtractionConfig with custom minimum length
+    ///
+    /// # Arguments
+    ///
+    /// * `min_length` - Minimum string length in bytes
+    ///
+    /// # Returns
+    ///
+    /// New AsciiExtractionConfig with specified minimum length and default max_length (None)
+    ///
+    /// # Example
+    ///
+    /// ```rust
+    /// use stringy::extraction::ascii::AsciiExtractionConfig;
+    ///
+    /// let config = AsciiExtractionConfig::new(8);
+    /// assert_eq!(config.min_length, 8);
+    /// assert_eq!(config.max_length, None);
+    /// ```
+    pub fn new(min_length: usize) -> Self {
+        Self {
+            min_length,
+            max_length: None,
+        }
+    }
+}


⚠️ Potential issue | 🟡 Minor

Validate AsciiExtractionConfig invariants.
min_length can be 0 and max_length can be set below min_length, which can cause excessive output and false positives. Consider a validated constructor (or a validate() method) that returns a StringyError when min_length < 1 or max_length < min_length, or enforce these checks at the CLI boundary.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/extraction/ascii/mod.rs` around lines 39 - 108, The AsciiExtractionConfig currently allows invalid values (min_length == 0 or max_length < min_length); add validation to prevent these invariants by implementing either a checked constructor and/or a validate() method: update AsciiExtractionConfig::new (and/or implement TryFrom/try_new) to return a Result<AsciiExtractionConfig, StringyError> that rejects min_length < 1 and rejects Some(max) where max < min_length, and update Default (or keep it) to produce valid defaults; ensure any callers that construct AsciiExtractionConfig (including use of Default or new) call the fallible constructor or validate() so invalid configs are rejected upstream.

coderabbitai · 2026-02-18T02:02:10Z

+#[test]
+fn test_length_preservation() {
+    // Test that length is preserved correctly for UTF-16 strings
+    let strings = vec![
+        FoundString {
+            text: "Test".to_string(),
+            original_text: None,
+            encoding: Encoding::Utf16Le,
+            offset: 0x100,
+            rva: Some(0x1000),
+            section: None,
+            length: 8, // 4 characters * 2 bytes = 8 bytes
+            tags: vec![],
+            score: 10,
+            section_weight: None,
+            semantic_boost: None,
+            noise_penalty: None,
+            source: StringSource::SectionData,
+            confidence: 0.8,
+        },
+        FoundString {
+            text: "Test".to_string(),
+            original_text: None,
+            encoding: Encoding::Utf16Le,
+            offset: 0x200,
+            rva: Some(0x2000),
+            section: None,
+            length: 8,
+            tags: vec![],
+            score: 15,
+            section_weight: None,
+            semantic_boost: None,
+            noise_penalty: None,
+            source: StringSource::SectionData,
+            confidence: 0.9,
+        },
+    ];
+
+    let canonical = deduplicate(strings, None, true);
+    assert_eq!(canonical.len(), 1);
+    assert_eq!(canonical[0].occurrences[0].length, 8);
+    assert_eq!(canonical[0].occurrences[1].length, 8);
+
+    // Verify to_found_string() uses stored length, not text.len()
+    let found = canonical[0]
+        .to_found_string()
+        .expect("canonical string with occurrences should convert");
+    assert_eq!(found.length, 8); // Should be 8 bytes, not 4 (text.len())
+    assert_eq!(found.text.len(), 4); // But text is still 4 characters
+}


🧹 Nitpick | 🔵 Trivial

LGTM - length preservation test, minor comment clarification.

Good test for UTF-16 byte-length preservation through the dedup pipeline.

Line 249: The comment says "4 characters" but text.len() returns bytes. For ASCII "Test", these happen to be equal. Consider rewording to "4 bytes (ASCII)" for precision, or use text.chars().count() if you intend to count characters.

Clarify comment

- assert_eq!(found.text.len(), 4); // But text is still 4 characters + assert_eq!(found.text.len(), 4); // text.len() returns bytes; "Test" is 4 ASCII bytes

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#[test]

fn test_length_preservation() {

// Test that length is preserved correctly for UTF-16 strings

let strings = vec![

FoundString {

text: "Test".to_string(),

original_text: None,

encoding: Encoding::Utf16Le,

offset: 0x100,

rva: Some(0x1000),

section: None,

length: 8, // 4 characters * 2 bytes = 8 bytes

tags: vec![],

score: 10,

section_weight: None,

semantic_boost: None,

noise_penalty: None,

source: StringSource::SectionData,

confidence: 0.8,

},

FoundString {

text: "Test".to_string(),

original_text: None,

encoding: Encoding::Utf16Le,

offset: 0x200,

rva: Some(0x2000),

section: None,

length: 8,

tags: vec![],

score: 15,

section_weight: None,

semantic_boost: None,

noise_penalty: None,

source: StringSource::SectionData,

confidence: 0.9,

},

];

let canonical = deduplicate(strings, None, true);

assert_eq!(canonical.len(), 1);

assert_eq!(canonical[0].occurrences[0].length, 8);

assert_eq!(canonical[0].occurrences[1].length, 8);

// Verify to_found_string() uses stored length, not text.len()

let found = canonical[0]

.to_found_string()

.expect("canonical string with occurrences should convert");

assert_eq!(found.length, 8); // Should be 8 bytes, not 4 (text.len())

assert_eq!(found.text.len(), 4); // But text is still 4 characters

}

#[test]

fn test_length_preservation() {

// Test that length is preserved correctly for UTF-16 strings

let strings = vec![

FoundString {

text: "Test".to_string(),

original_text: None,

encoding: Encoding::Utf16Le,

offset: 0x100,

rva: Some(0x1000),

section: None,

length: 8, // 4 characters * 2 bytes = 8 bytes

tags: vec![],

score: 10,

section_weight: None,

semantic_boost: None,

noise_penalty: None,

source: StringSource::SectionData,

confidence: 0.8,

},

FoundString {

text: "Test".to_string(),

original_text: None,

encoding: Encoding::Utf16Le,

offset: 0x200,

rva: Some(0x2000),

section: None,

length: 8,

tags: vec![],

score: 15,

section_weight: None,

semantic_boost: None,

noise_penalty: None,

source: StringSource::SectionData,

confidence: 0.9,

},

];

let canonical = deduplicate(strings, None, true);

assert_eq!(canonical.len(), 1);

assert_eq!(canonical[0].occurrences[0].length, 8);

assert_eq!(canonical[0].occurrences[1].length, 8);

// Verify to_found_string() uses stored length, not text.len()

let found = canonical[0]

.to_found_string()

.expect("canonical string with occurrences should convert");

assert_eq!(found.length, 8); // Should be 8 bytes, not 4 (text.len())

assert_eq!(found.text.len(), 4); // text.len() returns bytes; "Test" is 4 ASCII bytes

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/extraction/dedup/tests/scoring.rs` around lines 201 - 250, Update the inline comment in the test_length_preservation unit test to avoid ambiguity about .text.len(): either reword the comment on the assertion from "4 characters" to "4 bytes (ASCII)" to be precise about .text.len() returning byte length for ASCII, or change the assertion check to use text.chars().count() if you intend to assert character count; reference the test function name test_length_preservation and the calls to found.text.len()/.length to locate where to adjust the comment or the check.

coderabbitai · 2026-02-18T02:02:10Z

+pub fn extract_resources(data: &[u8]) -> Vec<ResourceMetadata> {
+    // Attempt to parse PE using pelite
+    let pe = match PeFile::from_bytes(data) {
+        Ok(pe) => pe,
+        Err(_) => {
+            // Graceful degradation: return empty vec on parse error
+            // This allows PE parsing to succeed even if resource extraction fails
+            return Vec::new();
+        }
+    };
+
+    // Get resource directory
+    let resources = match pe.resources() {
+        Ok(resources) => resources,
+        Err(_) => {
+            // No resource directory or error accessing it - not an error condition
+            return Vec::new();
+        }
+    };
+
+    // Enumerate all resources - handle errors gracefully
+    enumerate_resources(&resources).unwrap_or_default()


⚠️ Potential issue | 🟠 Major

Return errors instead of empty Vec for resource parsing failures.
extract_resources discards parse failures and returns an empty vector, which hides errors from callers. Please return Result<Vec, StringyError> and propagate context so callers can distinguish "no resources" from "parse failed".

As per coding guidelines, "Use comprehensive error handling with a StringyError enum across the codebase" and "Use StringyError variants from types.rs for error handling, never raw strings".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/extraction/pe_resources/mod.rs` around lines 149 - 170, extract_resources currently swallows PE parse and resource access errors and returns an empty Vec; change its signature to return Result<Vec<ResourceMetadata>, StringyError>, map PeFile::from_bytes errors and pe.resources() errors into appropriate StringyError variants from types.rs (e.g., a parse/pe variant and a resource-access variant) with contextual messages (do not use raw strings), and propagate enumerate_resources results via ? or map_err to a StringyError rather than unwrap_or_default; also update callers to handle the Result return so they can distinguish "no resources" from "parse failed".

coderabbitai · 2026-02-18T02:02:10Z

+/// Helper to get fixture path
+fn get_fixture_path(name: &str) -> std::path::PathBuf {
+    Path::new(env!("CARGO_MANIFEST_DIR"))
+        .join("tests")
+        .join("fixtures")
+        .join(name)
+}


⚠️ Potential issue | 🟡 Minor

Use pub(crate) for shared test helpers.
get_fixture_path is used across test submodules; mark it pub(crate) to align with the test helper visibility rule.

Proposed change

-fn get_fixture_path(name: &str) -> std::path::PathBuf { +pub(crate) fn get_fixture_path(name: &str) -> std::path::PathBuf {

As per coding guidelines, "Use pub(crate) visibility for internal test helpers".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

/// Helper to get fixture path

fn get_fixture_path(name: &str) -> std::path::PathBuf {

Path::new(env!("CARGO_MANIFEST_DIR"))

.join("tests")

.join("fixtures")

.join(name)

}

/// Helper to get fixture path

pub(crate) fn get_fixture_path(name: &str) -> std::path::PathBuf {

Path::new(env!("CARGO_MANIFEST_DIR"))

.join("tests")

.join("fixtures")

.join(name)

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/extraction/pe_resources/tests/mod.rs` around lines 14 - 20, The helper function get_fixture_path is used by multiple test submodules but is currently private; change its visibility to pub(crate) so other tests in the crate can access it. Locate the function named get_fixture_path in src/extraction/pe_resources/tests/mod.rs and add the pub(crate) visibility modifier to its declaration, keeping the implementation and signature otherwise unchanged so it remains a crate-scoped test helper.

Address 8 pre-existing design issues surfaced during bot review of PR #135: 1. Remove dual encoding fields: delete `encodings` from ExtractionConfig, keep only `enabled_encodings`. Eliminates confusing OR-checks throughout basic_extractor.rs. 2. Extract shared logic from extract/extract_canonical: factor out the duplicated section-sorting, filtering, symbol-inclusion, and semantic enrichment pipeline into a private `collect_all_strings()` helper. 3. Fix O(n^2) dedup in Auto byte order mode: replace `strings.iter_mut().find()` with HashMap-based O(1) lookup in utf16/mod.rs. 4. Remove unused `_preserve_all_occurrences` parameter from `deduplicate()` signature, ExtractionConfig struct, and all callsites. 5. Remove broken byte order consistency heuristic: the `check_byte_order_consistency()` function operated on decoded u16 values which are identical regardless of original byte order, making LE always score 1.0 and BE always 0.0. Redistribute the 0.1 weight to valid_unicode (0.35) and printable (0.45) heuristics. 6. Fix char_count semantics for surrogate pairs: change `char_count += 1` to `char_count += consumed_units` in utf16/extraction.rs so min/max length checks count UTF-16 code units, not scalar values. 7. Replace `Result<_, ()>` with `StringyError` in UTF-16 decode functions: use `StringyError::ParseError` for the 3 decode functions in utf16/extraction.rs, removing `#[allow(clippy::result_unit_err)]`. 8. Add noncharacter validation in surrogate-pair path: reject U+xFFFE and U+xFFFF in supplementary planes (planes 1-16) in utf16/validation.rs, matching the BMP noncharacter check. Signed-off-by: Kevin Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Copilot

Pull request overview

Copilot reviewed 44 out of 44 changed files in this pull request and generated 2 comments.

Copilot · 2026-02-18T02:13:48Z

+    /// Minimum length for ASCII strings (default: 4, same as min_length)
+    pub min_ascii_length: usize,
+    /// Minimum length for UTF-16 strings (default: 3, for future use)
+    pub min_wide_length: usize,
+    /// Which encodings to extract (default: ASCII, UTF-8)


min_wide_length is documented as a general "minimum length" but is currently used as Utf16ExtractionConfig::min_length (UTF-16 code units) in basic_extractor.rs. The current validate() logic also compares max_length (bytes) against min_wide_length directly, which is inconsistent if min_wide_length is in UTF-16 units. Please clarify the unit for min_wide_length in the doc comment and adjust validation/conversion accordingly (eg validate against max_length / 2 if it is UTF-16 code units, or convert it to code units at the use-site if you want to keep it byte-based).

Copilot · 2026-02-18T02:13:48Z

+        if self.max_length < self.min_wide_length {
+            return Err(StringyError::ConfigError(format!(
+                "max_length ({}) must be >= min_wide_length ({})",
+                self.max_length, self.min_wide_length
+            )));


validate() currently checks self.max_length < self.min_wide_length, but max_length is in bytes while min_wide_length is used as the UTF-16 min_length (code units). This can both allow invalid UTF-16 configs (if min_wide_length is units) and reject valid ones (if callers treat it as bytes). Align the units here (eg compare max_length / 2 against min_wide_length if min_wide_length is UTF-16 units).

Add .github/actionlint.yml config to suppress lint errors from release.yml, which is auto-generated by cargo-dist and cannot be hand-edited. Signed-off-by: Kevin Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

The cspell tool was configured but never installed in mise, so lint-spell always failed. The project's domain-heavy jargon (binary analysis terms, hex patterns) required 70+ exception words, making the maintenance cost outweigh the marginal value. Typos in identifiers are caught at compile time, and documentation quality is covered by CodeRabbit reviews. Signed-off-by: Kevin Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Copilot

Pull request overview

Copilot reviewed 46 out of 46 changed files in this pull request and generated 1 comment.

mergify · 2026-02-18T02:46:29Z

#135) ## Summary - **CI**: Use `just` recipes in CI workflows instead of raw cargo commands; fix cargo-dist invocation (use `dist plan` binary via mise instead of `cargo dist check` subcommand) - **Extraction refactor**: Split 6 oversized files (all >500 lines) in `src/extraction/` into focused module directories -- total lines unchanged, all non-test files now under 500 lines - **CLI**: Wire up the extraction pipeline in `main.rs` with format selection (`--format table|json|yara`) and minimum length filtering (`--min-length`) - **Design debt cleanup**: Resolve 8 pre-existing design issues surfaced during bot review ## Details ### CI changes - `ci.yml`: test, test-cross-platform, and coverage jobs now use `just test-ci`, `just build-release`, `just coverage` - `security.yml`: uses `just deny`, `just outdated`, `just dist-check` (fixes cargo-dist not found) - `docs.yml`: uses `just docs-build` for combined mdbook + rustdoc build - `justfile`: updated recipes with `--all-features`, added `outdated` recipe, fixed `dist-check` to use `dist plan` ### Extraction module splits | Original file | Lines | Split into | |---|---|---| | `extraction/mod.rs` | 1,576 | `mod.rs` (154) + `traits.rs` + `helpers.rs` + `basic_extractor.rs` + `tests.rs` | | `extraction/filters.rs` | 702 | `filters/mod.rs` + `filters/implementations.rs` + `filters/tests.rs` | | `extraction/ascii.rs` | 832 | `ascii/mod.rs` + `ascii/extraction.rs` + `ascii/tests.rs` | | `extraction/utf16.rs` | 1,273 | `utf16/mod.rs` + `utf16/config.rs` + `utf16/validation.rs` + `utf16/confidence.rs` + `utf16/extraction.rs` + `utf16/tests.rs` | | `extraction/dedup.rs` | 838 | `dedup/mod.rs` + `dedup/scoring.rs` + `dedup/tests.rs` | | `extraction/pe_resources.rs` | 1,449 | `pe_resources/mod.rs` + `pe_resources/detection.rs` + `pe_resources/version_info.rs` + `pe_resources/string_tables.rs` + `pe_resources/manifests.rs` + `pe_resources/tests.rs` | ### CLI pipeline - `main.rs`: reads binary, detects format, parses container, extracts strings, formats output - 4 new integration tests in `tests/integration_cli.rs` ### Design debt cleanup (8 items) 1. **Remove dual encoding fields**: Deleted `encodings` from `ExtractionConfig`, kept only `enabled_encodings`. Eliminated confusing OR-checks in `basic_extractor.rs`. 2. **Extract shared logic**: Factored duplicated section-sorting/filtering/symbol-inclusion pipeline from `extract()` and `extract_canonical()` into `collect_all_strings()` helper. 3. **Fix O(n^2) dedup in Auto mode**: Replaced `strings.iter_mut().find()` with HashMap-based O(1) lookup in `utf16/mod.rs`. 4. **Remove unused parameter**: Removed `_preserve_all_occurrences` from `deduplicate()` signature, `ExtractionConfig`, and all callsites. 5. **Remove broken byte order heuristic**: `check_byte_order_consistency()` operated on decoded u16 values (identical regardless of byte order), making LE always score 1.0 and BE always 0.0. Removed and redistributed weight. 6. **Fix char_count semantics**: Changed `char_count += 1` to `+= consumed_units` in UTF-16 extraction so min/max length checks count code units, not scalar values. 7. **Replace `Result<_, ()>`**: Changed 3 UTF-16 decode functions to return `StringyError::ParseError`, removed `#[allow(clippy::result_unit_err)]`. 8. **Add surrogate noncharacter validation**: Reject U+xFFFE/U+xFFFF in supplementary planes (planes 1-16), matching existing BMP noncharacter check. ## Test plan - [x] `just ci-check` passes locally (478 tests, zero clippy warnings) - [ ] CI passes on all platforms (ubuntu, macos, windows) - [x] CodeRabbit/Copilot review feedback addressed - [x] Pre-existing design debt resolved (8 items) Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

unclesp1d3r added 14 commits February 16, 2026 20:39

refactor(extraction): extract traits and config to traits.rs

0e22f4d

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

refactor(extraction): extract helper functions to helpers.rs

1996158

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

refactor(extraction): extract tests to tests.rs

9fd8ce3

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

test(cli): add failing integration test for binary file processing

bd13f29

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

feat(cli): wire up extraction pipeline in main.rs

5a1961f

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

test(cli): add integration tests for output formats and error handling

8bb9e8f

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Copilot AI review requested due to automatic review settings February 17, 2026 15:11

coderabbitai Bot added documentation Improvements or additions to documentation enhancement New feature or request labels Feb 17, 2026

Copilot started reviewing on behalf of unclesp1d3r February 17, 2026 15:17 View session

coderabbitai Bot reviewed Feb 17, 2026

View reviewed changes

Copilot AI reviewed Feb 17, 2026

View reviewed changes

unclesp1d3r added 2 commits February 17, 2026 10:23

Copilot AI review requested due to automatic review settings February 18, 2026 00:45

Copilot started reviewing on behalf of unclesp1d3r February 18, 2026 00:46 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

fix(cli): parse min_length as usize to avoid u64 truncation

55518b1

Use usize directly for min_length CLI field instead of u64 with cast. config.validate() handles the >= 1 constraint. Signed-off-by: Keith Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Copilot AI review requested due to automatic review settings February 18, 2026 01:02

Copilot started reviewing on behalf of unclesp1d3r February 18, 2026 01:02 View session

coderabbitai Bot reviewed Feb 18, 2026

View reviewed changes

Copilot AI reviewed Feb 18, 2026

View reviewed changes

unclesp1d3r self-assigned this Feb 18, 2026

unclesp1d3r changed the title ~~refactor: split oversized extraction modules and wire up CLI pipeline~~ refactor(extraction): split oversized modules and wire up CLI pipeline Feb 18, 2026

coderabbitai Bot reviewed Feb 18, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings February 18, 2026 02:08

Copilot started reviewing on behalf of unclesp1d3r February 18, 2026 02:08 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

unclesp1d3r added 2 commits February 17, 2026 21:19

Copilot AI review requested due to automatic review settings February 18, 2026 02:37

Copilot started reviewing on behalf of unclesp1d3r February 18, 2026 02:38 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

Comment thread justfile

unclesp1d3r added the lgtm This PR has been approved by a maintainer label Feb 18, 2026

mergify Bot added the queued label Feb 18, 2026

mergify Bot added a commit that referenced this pull request Feb 18, 2026

Merge of #135

fd99943

mergify Bot mentioned this pull request Feb 18, 2026

merge queue: embarking main (bd84315) and #135 together #136

Closed

51 tasks

mergify Bot merged commit f976cf4 into main Feb 18, 2026
33 of 34 checks passed

mergify Bot deleted the fix_dist_issues branch February 18, 2026 02:50

mergify Bot removed the queued label Feb 18, 2026

This was referenced Mar 10, 2026

feat: Stringy v1.0 -- pipeline architecture, CLI hardening, and comprehensive testing #141

Merged

feat: create classifier trait and classification pipeline architecture #140

Closed

docs: Improve README, CLI UX, and documentation accuracy #147

Merged

Uh oh!

Conversation

unclesp1d3r commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

CI changes

Extraction module splits

CLI pipeline

Design debt cleanup (8 items)

Test plan

Uh oh!

coderabbitai Bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (2 warnings)

Uh oh!

mergify Bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

🟢 CI must pass

🟢 Do not merge outdated PRs

Uh oh!

codecov Bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mergify Bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 CI Insights

🟢 All jobs passed!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

unclesp1d3r commented Feb 17, 2026 •

edited

Loading

coderabbitai Bot commented Feb 17, 2026 •

edited

Loading

mergify Bot commented Feb 17, 2026 •

edited

Loading

codecov Bot commented Feb 17, 2026 •

edited

Loading

mergify Bot commented Feb 17, 2026 •

edited

Loading

mergify Bot commented Feb 18, 2026 •

edited

Loading