refactor(extraction): split oversized modules and wire up CLI pipeline#135
Conversation
Centralize CI commands in justfile as single source of truth: - Replace raw cargo commands with just recipes in test, test-cross-platform, coverage, security, and docs workflows - Fix cargo-dist issue in security workflow by using `dist plan` via mise exec instead of `cargo dist check` (subcommand not found) - Add --all-features to test-ci, build-release, and coverage recipes - Add outdated recipe for cargo-outdated checks - Fix dist-check recipe to use `dist plan` (check subcommand does not exist in cargo-dist 0.30.3) Quality and MSRV jobs unchanged (use dtolnay Rust, not mise). Signed-off-by: Kevin Melton <kevin@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Move the ~400-line `impl StringExtractor for BasicExtractor` block from mod.rs into its own file, reducing mod.rs from 1038 to 630 lines. The new basic_extractor.rs (421 lines) contains the core extraction engine that orchestrates ASCII, UTF-8, and UTF-16 string extraction across binary sections. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Convert the 702-line filters.rs into a module directory with three focused files: - mod.rs (193 lines): FilterContext, NoiseFilter trait, CompositeNoiseFilter - implementations.rs (394 lines): all 6 filter structs and NoiseFilter impls - tests.rs (131 lines): all test code All public re-exports remain unchanged so the API is not affected. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Convert src/extraction/ascii.rs (832 lines) into a module directory with three focused files: - mod.rs (147 lines): AsciiExtractionConfig, is_printable_ascii(), re-exports - extraction.rs (316 lines): extract_ascii_strings(), extract_from_section() - tests.rs (378 lines): all unit tests All files are under the 500-line limit. All 473 tests pass. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Convert the 1,273-line src/extraction/utf16.rs into a module directory with six focused files, each under 500 lines: - mod.rs (122 lines): ByteOrder enum, extract_utf16_strings orchestrator, re-exports - config.rs (97 lines): Utf16ExtractionConfig struct and constructors - validation.rs (281 lines): UTF-16 sequence validation, printability checks, null patterns - confidence.rs (201 lines): Confidence scoring heuristics (ASCII ratio, printable ratio, etc.) - extraction.rs (380 lines): Core extraction functions and section-aware extraction - tests.rs (238 lines): All unit tests All public exports remain identical. No functional changes. Signed-off-by: Kyle Melton <kyle@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Convert the 838-line dedup.rs into a module directory with three files: - mod.rs (237 lines): CanonicalString, StringOccurrence structs, deduplicate(), merge_tags(), found_string_to_occurrence(), CanonicalString::to_found_string() - scoring.rs (56 lines): calculate_combined_score() - tests.rs (555 lines): All test code Public API and re-exports remain unchanged. Signed-off-by: Kevin Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Convert the largest file (1,449 lines) into a module directory with six focused files, each under 500 lines: - mod.rs (230): constants, entry points, orchestrator, decode_utf16le - detection.rs (208): detect_version_info, detect_string_tables, detect_manifests - version_info.rs (76): extract_version_info_strings - string_tables.rs (200): parse_string_table_block, extract_string_table_strings - manifests.rs (201): detect_manifest_encoding, decode_manifest, extract_manifest_strings - tests.rs (594): all test code All 473 tests pass. Public API unchanged. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Document the CI workflow patterns (dtolnay vs mise, just recipes as single source of truth) and Mergify behavior in AGENTS.md. Add the extraction refactor and CLI implementation plan for project history. Signed-off-by: Keith Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request implements a comprehensive refactor of the extraction module architecture, restructuring flat extraction modules (ASCII, UTF-16, PE resources, deduplication) into modular subdirectories with trait-based abstractions. It introduces a BasicExtractor implementation, noise filtering framework, semantic enrichment helpers, and CLI wiring for binary analysis with multiple output formats (Table, JSON, YARA), complemented by CI/workflow automation updates and integration test coverage. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Rationale: This refactor spans 50+ files with heterogeneous changes: comprehensive module restructuring (ASCII/UTF-16/PE resources/dedup), trait-based architecture design, complex extraction pipelines with multi-encoding support and configurable filtering, new noise filtering framework with six sub-filters and scoring heuristics, PE resource modularization requiring Pelite integration understanding, and CLI wiring. The density of interconnected logic—section-aware metadata population, confidence computation across multiple heuristics, threshold-based deduplication with bonus scoring, surrogate-pair handling in UTF-16—requires careful verification of correctness, false-positive reduction in classification, and proper error handling in parsing/decoding paths. Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 6 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 CI must passWonderful, this rule succeeded.All CI checks must pass. This protection prevents manual merges that bypass the merge queue.
🟢 Do not merge outdated PRsWonderful, this rule succeeded.Make sure PRs are within 10 commits of the base branch before merging
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
🧪 CI InsightsHere's what we observed from your CI run for 9c5da41. 🟢 All jobs passed!But CI Insights is watching 👀 |
There was a problem hiding this comment.
Actionable comments posted: 33
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/extraction/filters/implementations.rs (1)
297-302: 🧹 Nitpick | 🔵 TrivialRedundant cast -
totalis alreadyf32.Line 299 has
(total as usize) as f32buttotalis already declared asf32on line 278. This double-cast is unnecessary.Suggested fix
- let min_pattern_len_for_3_reps = ((total as usize) as f32 / 3.0).ceil() as usize; + let min_pattern_len_for_3_reps = (total / 3.0).ceil() as usize;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/extraction/filters/implementations.rs` around lines 297 - 302, The expression computing min_pattern_len_for_3_reps contains redundant casts of total; since total is already an f32, replace the double-cast `((total as usize) as f32 / 3.0).ceil() as usize` with a direct floating calculation like `(total / 3.0).ceil() as usize` in the block that uses variables total and max_pattern_len so min_pattern_len_for_3_reps is computed without needless casting.src/extraction/mod.rs (1)
141-151:⚠️ Potential issue | 🟠 MajorAddress re-export naming inconsistency for
extract_from_section.The re-export in
mod.rsaliases UTF-16'sextract_from_sectiontoextract_utf16_from_section(line 149) but leaves ASCII's version unaliased (line 141). However, test code imports directly from submodules (e.g.,test_utf16_extraction.rsimportsextract_from_sectionfrom the UTF-16 module), bypassing the top-level aliases entirely. This defeats the aliasing intent and violates the guideline requiring imports fromstringy::extraction, not nested paths.Ensure consistent re-export naming: either alias both functions (ASCII as
extract_ascii_from_section, UTF-16 asextract_utf16_from_section), or export both asextract_from_sectionwith qualified module paths. Update test imports to use top-level re-exports viastringy::extractionto align with project conventions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/extraction/mod.rs` around lines 141 - 151, The top-level re-export naming is inconsistent: ascii::extract_from_section is exported as extract_from_section while utf16::extract_from_section is aliased to extract_utf16_from_section; update mod.rs to make these symmetric and adjust tests to import from stringy::extraction. Either (recommended) alias both exports to explicit names (extract_ascii_from_section and extract_utf16_from_section) by changing the ascii re-export to export extract_from_section as extract_ascii_from_section and keep the utf16 alias extract_utf16_from_section, or export both as extract_from_section within their modules and update tests to import via the top-level path (stringy::extraction::{extract_ascii_from_section, extract_utf16_from_section} or stringy::extraction::extract_from_section if you choose unambiguous names); ensure tests like test_utf16_extraction.rs import from stringy::extraction instead of nested module paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@AGENTS.md`:
- Around line 70-77: Update the first bullet under "CI Architecture" to state
that CI workflows use `just` recipes as the single source of truth except for
Quality/MSRV jobs; explicitly mention that Quality/MSRV jobs use
`dtolnay/rust-toolchain` and must not use `just` recipes because they rely on
`mise exec --` which conflicts, so readers immediately see the exception before
the separate bullet about Quality/MSRV.
In `@docs/plans/2026-02-16-extraction-refactor-and-cli.md`:
- Around line 59-63: Several fenced code blocks (including the one containing
the git add/git commit snippet) lack language specifiers; update those fenced
blocks to include an appropriate shell language tag (e.g., ```bash or ```sh) so
tooling and syntax highlighting work correctly, ensuring you add the specifier
to both opening fences and any other command/code blocks referenced in this
document.
In `@src/extraction/ascii/extraction.rs`:
- Around line 120-169: The end-of-buffer branch duplicates FoundString
construction; extract a helper function or closure (e.g., make_found_string)
that takes the taken bytes, start, len and returns a FoundString, then replace
the duplicated blocks in the max_length branch and the else branch (and any
other places in this loop) with strings.push(make_found_string(bytes, start,
len)); ensure the helper sets the same fields used now (text from
String::from_utf8, Encoding::Ascii, offset from current_string_start, length,
source StringSource::SectionData, confidence 1.0, etc.) and use
std::mem::take(&mut current_string_bytes) only once per invocation.
In `@src/extraction/ascii/tests.rs`:
- Around line 277-293: In test_extract_from_section_rva_calculation replace the
conditional guard so the test fails if no strings were extracted: after calling
create_test_section, AsciiExtractionConfig::default, and extract_from_section,
assert that strings is non-empty (e.g. assert!(!strings.is_empty()) or
assert_eq!(strings.len(), 1)) and then perform the existing assertions on
strings[0].offset and strings[0].rva; this removes the silent pass and ensures
extract_from_section actually produced the expected string.
In `@src/extraction/basic_extractor.rs`:
- Around line 19-125: The extract() and extract_canonical() implementations in
BasicExtractor duplicate section sorting, filtering, symbol inclusion and
semantic enrichment; extract that shared logic into a private helper fn (e.g.,
extract_all_strings(&self, data: &[u8], container_info: &ContainerInfo, config:
&ExtractionConfig) -> Result<Vec<FoundString>>) which performs the section sort
(same key logic), applies the Debug/code filters, calls
self.extract_from_section for each section, includes import/export symbols when
config.include_symbols is true, and calls apply_semantic_enrichment; then have
extract() call this helper and only run deduplication via
deduplicate(...)/to_found_string conversion when config.enable_deduplication,
and have extract_canonical() call the same helper but return the
canonical/deduped output as needed.
- Around line 277-278: The code is checking both config.encodings and
config.enabled_encodings (e.g., in the ascii_enabled and utf8_enabled
computations using Encoding::Ascii/Encoding::Utf8), indicating legacy dual-field
handling; consolidate to a single source of truth by standardizing all checks to
one field (pick config.encodings or rename to config.enabled_encodings
consistently), update occurrences where ascii_enabled and utf8_enabled are
computed to reference only the canonical field, and add a brief inline comment
like "// TODO: deprecate enabled_encodings" next to the kept field to signal
migration; ensure any helper functions or downstream uses (the variables
ascii_enabled, utf8_enabled) are updated to the single-field logic so behavior
is unchanged.
- Around line 261-264: The current overflow handling for computing end_offset
uses checked_add(...).unwrap_or(data.len()).min(data.len()) which is defensive
but can be clearer; change the computation in the end_offset assignment to use
saturating_add on section_offset (i.e.,
section_offset.saturating_add(section_size)) and then .min(data.len()) to
express explicit saturation semantics for end_offset (referencing end_offset,
section_offset, section_size, and data.len()).
In `@src/extraction/dedup/mod.rs`:
- Around line 212-236: to_found_string currently index-accesses
self.occurrences[0] which panics if CanonicalString.occurrences is empty; change
to use self.occurrences.first() and make the function return Option<FoundString>
(or Result) so it can return None when there are no occurrences, compute
max_confidence by iterating over occurrences (or return None early), and build
the FoundString from the first occurrence when present (preserving fields like
text, original_text, encoding, offset, rva, section, length, tags = merged_tags,
score = combined_score, source, confidence = max_confidence); update callers of
to_found_string accordingly.
- Around line 228-237: The code unnecessarily clones `fs` before calling
`found_string_to_occurrence(fs.clone())` inside the iterator consumer; since
`fs` is already moved by `into_iter()` you should pass `fs` directly to
`found_string_to_occurrence` to avoid an extra clone—locate the iterator that
calls `into_iter()` and replace `found_string_to_occurrence(fs.clone())` with
`found_string_to_occurrence(fs)` (the surrounding context includes identifiers
like found_string_to_occurrence, into_iter, fs, merged_tags, and
combined_score).
In `@src/extraction/dedup/tests.rs`:
- Around line 17-27: The UTF-16 byte-length computation in the match for
Encoding (where length is set from text.chars().count() * 2) undercounts
characters that need surrogate pairs; either update the calculation to use
text.encode_utf16().count() * 2 to count UTF-16 code units correctly, or keep
the current code but change the comment to explicitly state it assumes BMP-only
characters for these tests; modify the match arm and the inline comment near the
Encoding handling (the length variable and the text.* call) accordingly.
- Line 6: Add an inline justification comment for the
#[allow(clippy::too_many_arguments)] attribute on the test helper so it
documents why many parameters are necessary (e.g., "helper constructs test data
with full metadata; multiple args required for clarity and coverage"). Place the
comment immediately after the attribute on the same line or directly above it,
referencing the existing attribute #[allow(clippy::too_many_arguments)] so
reviewers can see the rationale for the exception.
In `@src/extraction/helpers.rs`:
- Around line 124-125: Add an inline justification comment for the
#[allow(clippy::collapsible_if)] attribute where it appears (the conditional
checking current_string_bytes.len() against min_length and max_length and the
similar block later) by appending a short rationale on the same line or
immediately after the attribute explaining why collapsing the ifs would break
the borrow/ownership intent (e.g., "allow because collapsing the ifs would
change borrow lifetimes and cause borrow-checker errors"); update both
occurrences to include this inline justification.
In `@src/extraction/pe_resources/manifests.rs`:
- Around line 68-98: The inline UTF-16BE decoding in decode_manifest should be
extracted into a shared helper (e.g., pub(crate) fn decode_utf16be(bytes: &[u8])
-> Result<String>) alongside the existing decode_utf16le in the parent
pe_resources module; implement the same logic currently in the match arm
(chunks_exact(2) -> u16::from_be_bytes -> String::from_utf16 -> trim nulls ->
map Err to StringyError::EncodingError) and then replace the UTF-16BE arm in
decode_manifest to call decode_utf16be(data) for consistency and reuse.
In `@src/extraction/pe_resources/tests.rs`:
- Around line 107-113: The test function
test_detect_version_info_empty_directory is empty and provides no coverage;
either remove it or implement it by creating a minimal fixture (an empty
RT_VERSION directory) and asserting the behavior of the code path (e.g., call
extract_resources(...) or the detect_version_info(...) helper used by
pe_resources and assert it returns no entries/does not panic). Update the test
to set up the empty directory, invoke extract_resources/detect_version_info, and
add an assertion on the returned result (empty list/Ok value) or delete the
empty test; apply the same fix to the other empty tests in tests.rs.
- Around line 1-594: The test file is getting large and should be split for
maintainability; extract resource-detection tests into detection_tests.rs and
string-extraction tests into extraction_tests.rs. Move tests that primarily
exercise extract_resources, VERSIONINFO/STRINGTABLE/MANIFEST detection and
fixture-based resource metadata checks into detection_tests.rs (any tests
referencing extract_resources, ResourceType, ResourceMetadata assertions), and
move Phase 2 string-focused tests (decode_utf16le, detect_manifest_encoding,
extract_version_info_strings, extract_string_table_strings,
extract_manifest_strings, extract_resource_strings and their fixture-based
variants) into extraction_tests.rs; keep or move the shared get_fixture_path
helper into a small tests/mod.rs or a shared utils file and update the tests
module declarations to pub use/declare those helpers so both new files can
access it. Finally, update the tests mod (or Cargo test configuration) to
include the two new files so cargo test still discovers all tests.
- Around line 99-105: The test computes `_has_version` from
`resources.iter().any(|r| matches!(r.resource_type, ResourceType::VersionInfo))`
but never asserts it; change this to an assertion that verifies absence (e.g.,
`assert!(!resources.iter().any(|r| matches!(r.resource_type,
ResourceType::VersionInfo)))`) or assign to `has_version` and
`assert_eq!(has_version, false)` so the test explicitly checks for no
VERSIONINFO resources; update the assertion in the test function that uses
`resources` to reflect this.
- Around line 363-370: The assertion is tautological because
`!resources.is_empty()` makes the `has_version_info`/`has_string_table` checks
redundant; change the check to explicitly detect other resource types instead.
Replace the `assert!(has_version_info || has_string_table ||
!resources.is_empty(), ...)` with a check that uses `resources.iter().any(|r|
!matches!(r.resource_type, ResourceType::VersionInfo |
ResourceType::StringTable))`, i.e. assert `has_version_info || has_string_table
|| resources.iter().any(|r| !matches!(r.resource_type, ResourceType::VersionInfo
| ResourceType::StringTable))` so you only pass when either known types are
present or some other resource type exists (referencing `has_version_info`,
`has_string_table`, `resources` and
`ResourceType::VersionInfo/ResourceType::StringTable`).
In `@src/extraction/pe_resources/version_info.rs`:
- Around line 21-35: In extract_version_info_strings, currently all Err branches
for PeFile::from_bytes, pe.resources(), and resources.version_info() silently
return an empty Vec; change these to log the error (at debug/trace) including
the returned Err value so parsing failures are visible while still returning an
empty Vec for the caller; specifically update the match arms around
PeFile::from_bytes, pe.resources(), and resources.version_info() to capture the
error (e.g., err) and call the project logger (or tracing::debug!) with context
like "failed to parse PE", "failed to get resources", or "failed to get
version_info" plus the err, keeping the final behavior of
extract_version_info_strings unchanged.
In `@src/extraction/tests.rs`:
- Around line 199-236: The ExtractionConfig API is redundant: remove the
obsolete encodings field and update all uses to rely solely on
enabled_encodings; update the ExtractionConfig struct and its Default impl to
drop encodings, change BasicExtractor's checks (e.g. the
ascii_enabled/utf8_enabled/utf16le_enabled/utf16be_enabled logic in
basic_extractor.rs) to reference only config.enabled_encodings.contains(...),
and update tests (e.g. test_basic_extractor_encoding_filtering in tests.rs) to
construct ExtractionConfig with enabled_encodings only. Ensure any
serialization/constructor code and pattern matches expecting encodings are
adjusted or migrated to enabled_encodings to avoid compilation errors.
In `@src/extraction/traits.rs`:
- Around line 43-58: The struct currently has two redundant fields, encodings
and enabled_encodings (both Vec<Encoding> with the same defaults); remove one to
eliminate ambiguity (prefer keeping the more clearly named enabled_encodings) or
consolidate them by renaming/merging encodings into enabled_encodings and update
all usages accordingly (search for references to encodings and
enabled_encodings, e.g., in parsing, defaults, and tests), preserve the original
default values and behavior when merging, and update the doc comment for the
remaining field to explain its purpose and remove the duplicate doc entry.
- Around line 119-156: ExtractionConfig::validate currently misses checks for
inverted/zero bounds and dedup settings; update validate() to ensure max_length
> 0, max_length >= min_length (or strictly > if you prefer to forbid equality),
and if dedup_threshold is Some(t) then t > 0 (reject zero). Specifically, inside
the validate method add checks referencing self.max_length (ensure > 0 and >=
self.min_length or > self.min_length) and self.dedup_threshold (use map or match
to assert t > 0 and return StringyError::ConfigError with a clear message on
failure) so invalid configs are rejected.
In `@src/extraction/utf16/confidence.rs`:
- Around line 90-94: The check for chars.is_empty() is redundant because the
function already returns for empty input earlier (lines ~68-70); remove the
if/else and simply return printable_count as f32 / chars.len() as f32 (using the
existing chars and printable_count identifiers) so the unreachable branch is
eliminated and the calculation is simplified.
In `@src/extraction/utf16/extraction.rs`:
- Around line 95-98: Add an inline justification comment for the
#[allow(clippy::result_unit_err)] on the decode_utf16le_bytes function: explain
that the unit error type is intentional because decode_utf16le returns no
actionable error information and callers only need success/failure, referencing
decode_utf16le and decode_utf16le_bytes to locate the rationale; keep the
comment brief and on the same line or immediately above the attribute.
In `@src/extraction/utf16/mod.rs`:
- Around line 86-98: The current add_with_dedup closure does O(n²) work by
scanning strings with strings.iter_mut().find(); replace it with a HashMap-based
dedup: create a HashMap keyed by (FoundString.offset, FoundString.encoding,
FoundString.text) mapping to the index in strings, update that map as you push
or replace entries, and when a duplicate is seen compare confidence to
strings[index].confidence and either replace strings[index] or skip; update or
insert the map entry accordingly (use the existing FoundString fields offset,
encoding, text, confidence and the add_with_dedup closure to locate where to
change the logic).
In `@src/extraction/utf16/tests.rs`:
- Around line 218-224: Update the test_utf16_odd_length_data test to assert a
concrete expected outcome instead of only ensuring no panic: call
extract_utf16_strings(data, &Utf16ExtractionConfig::default()) and assert that
the returned Vec either is empty or contains the partial decoded string "He"
(since input pairs decode to 'H','e' and the final 0x6C is an incomplete pair),
e.g. assert!(strings.is_empty() || strings.contains(&"He".to_string()));
reference the test function name, Utf16ExtractionConfig, and
extract_utf16_strings to locate and update the test.
- Around line 132-140: The test currently asserts be_string.is_some() and then
enters an if let Some(s) to re-check s.text and s.encoding redundantly; replace
the assert! + if-let pattern with a single unwrap (or use expect) on the result
of strings.iter().find(...) and then assert_eq! directly on the unwrapped value
to remove duplicated checks — locate the be_string variable created from
strings.iter().find(...), the Encoding::Utf16Be reference, and the subsequent
assert_eq! calls to update accordingly.
In `@src/extraction/utf16/validation.rs`:
- Around line 133-134: The #[allow(dead_code)] on is_valid_utf16_sequence lacks
the required justification; add a brief comment above the attribute explaining
why the function is intentionally kept (for example: "kept for future use",
"used by tests/benchmarks", or "exported for FFI but referenced conditionally")
so the attribute is justified and complies with guidelines; update the comment
to reference the function name is_valid_utf16_sequence to make the intent clear.
- Around line 116-120: The legacy wrapper is_printable_utf16le_char currently
passes None for next_code_unit so it cannot validate UTF-16 surrogate pairs
(high surrogates 0xD800–0xDBFF will always return false); update the function's
docstring to explicitly state this limitation and that callers needing
surrogate-pair validation should call is_printable_code_unit_or_pair (or use a
variant that accepts a next_code_unit) instead of relying on this
backward-compatibility wrapper. Mention the behavior for high and low surrogates
and the recommended alternative function name (is_printable_code_unit_or_pair)
in the docs.
- Around line 225-229: The branch checking chars.is_empty() at the end is
unreachable because an early return already handles the empty case; remove the
if/else and replace it with a single return of valid_count as f32 / chars.len()
as f32 (using the existing chars and valid_count variables) so the function
simply computes and returns the ratio without the redundant empty check.
In `@src/main.rs`:
- Around line 47-49: Change the short flag for the min_length CLI option to
avoid ambiguity: update the #[arg(...)] attribute on the min_length field
(currently #[arg(long, short = 'l', default_value_t = 4)]) to use a clearer
short flag such as 'n' or 'm' (e.g., short = 'n'), and update any related
usages, help text, README examples, and tests that reference the old -l flag to
the new short form to keep CLI docs and behavior consistent.
- Around line 74-75: OutputMetadata::new is being called with strings.len() for
both total_count and filtered_count, causing identical values; either document
this is intentional or pass a pre-filter total count instead. If there truly is
no post-extraction filtering, add a brief comment near the call (referencing
OutputMetadata::new, binary_name, output_format, and strings.len()) stating
total and filtered counts are intentionally identical; otherwise compute and use
the pre-filter total (capture the length of the raw/extracted collection before
any filtering into a variable like total_count) and pass that as the first
argument while passing strings.len() as filtered_count.
In `@tests/integration_cli.rs`:
- Around line 33-41: The test cli_invalid_file currently only checks the exit
status; update it to also validate stderr from
Command::new(env!("CARGO_BIN_EXE_stringy")) by converting output.stderr (bytes)
to a UTF-8 string and asserting it contains a helpful error message (e.g.,
non-empty or matches an expected substring such as "No such file" or "error");
modify the test to read let stderr = String::from_utf8_lossy(&output.stderr) and
add an assertion like assert!(!stderr.trim().is_empty(), ...) or
assert!(stderr.contains("expected substring"), ...) to ensure user-facing
diagnostics are present.
- Around line 43-51: The cli_min_length_flag test currently only asserts
success; update the test (cli_min_length_flag) to also read output.stdout (from
Command::new(...).output()), convert it to a UTF-8 string, split into lines (or
parse the expected output format), and assert that every returned string/item
has length >= 20 (or at least one string meets that criteria depending on
expected behavior). Use the existing Command invocation and output variable
names (output) and add assertions on String::from_utf8_lossy(&output.stdout) (or
equivalent) to verify the min-length flag actually filtered/affected results.
---
Outside diff comments:
In `@src/extraction/filters/implementations.rs`:
- Around line 297-302: The expression computing min_pattern_len_for_3_reps
contains redundant casts of total; since total is already an f32, replace the
double-cast `((total as usize) as f32 / 3.0).ceil() as usize` with a direct
floating calculation like `(total / 3.0).ceil() as usize` in the block that uses
variables total and max_pattern_len so min_pattern_len_for_3_reps is computed
without needless casting.
In `@src/extraction/mod.rs`:
- Around line 141-151: The top-level re-export naming is inconsistent:
ascii::extract_from_section is exported as extract_from_section while
utf16::extract_from_section is aliased to extract_utf16_from_section; update
mod.rs to make these symmetric and adjust tests to import from
stringy::extraction. Either (recommended) alias both exports to explicit names
(extract_ascii_from_section and extract_utf16_from_section) by changing the
ascii re-export to export extract_from_section as extract_ascii_from_section and
keep the utf16 alias extract_utf16_from_section, or export both as
extract_from_section within their modules and update tests to import via the
top-level path (stringy::extraction::{extract_ascii_from_section,
extract_utf16_from_section} or stringy::extraction::extract_from_section if you
choose unambiguous names); ensure tests like test_utf16_extraction.rs import
from stringy::extraction instead of nested module paths.
| ``` | ||
| git add src/extraction/traits.rs src/extraction/mod.rs | ||
| git commit -s -m "refactor(extraction): extract traits and config to traits.rs" | ||
| ``` | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add language specifiers to fenced code blocks.
Multiple code blocks throughout the document lack language specifiers (e.g., lines 63, 103, 142, etc.). Adding bash or shell for command blocks improves syntax highlighting and tooling support.
-```
+```bash
git add src/extraction/traits.rs src/extraction/mod.rs
git commit -s -m "refactor(extraction): extract traits and config to traits.rs"
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.20.0)</summary>
[warning] 63-63: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
---
[warning] 59-59: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @docs/plans/2026-02-16-extraction-refactor-and-cli.md around lines 59 - 63,
Several fenced code blocks (including the one containing the git add/git commit
snippet) lack language specifiers; update those fenced blocks to include an
appropriate shell language tag (e.g., bash or sh) so tooling and syntax
highlighting work correctly, ensuring you add the specifier to both opening
fences and any other command/code blocks referenced in this document.
</details>
<!-- fingerprinting:phantom:poseidon:ocelot -->
<!-- This is an auto-generated comment by CodeRabbit -->
| // Handle string at end of buffer | ||
| if let Some(start) = current_string_start { | ||
| let len = current_string_bytes.len(); | ||
| if len >= config.min_length { | ||
| // Check maximum length if configured | ||
| if let Some(max_len) = config.max_length { | ||
| if len > max_len { | ||
| // Skip this string | ||
| } else { | ||
| let bytes = std::mem::take(&mut current_string_bytes); | ||
| let text = String::from_utf8(bytes).expect("ASCII bytes should be valid UTF-8"); | ||
| strings.push(FoundString { | ||
| text, | ||
| original_text: None, | ||
| encoding: Encoding::Ascii, | ||
| offset: start as u64, | ||
| rva: None, | ||
| section: None, | ||
| length: len as u32, | ||
| tags: Vec::new(), | ||
| score: 0, | ||
| section_weight: None, | ||
| semantic_boost: None, | ||
| noise_penalty: None, | ||
| source: StringSource::SectionData, | ||
| confidence: 1.0, | ||
| }); | ||
| } | ||
| } else { | ||
| let bytes = std::mem::take(&mut current_string_bytes); | ||
| let text = String::from_utf8(bytes).expect("ASCII bytes should be valid UTF-8"); | ||
| strings.push(FoundString { | ||
| text, | ||
| original_text: None, | ||
| encoding: Encoding::Ascii, | ||
| offset: start as u64, | ||
| rva: None, | ||
| section: None, | ||
| length: len as u32, | ||
| tags: Vec::new(), | ||
| score: 0, | ||
| section_weight: None, | ||
| semantic_boost: None, | ||
| noise_penalty: None, | ||
| source: StringSource::SectionData, | ||
| confidence: 1.0, | ||
| }); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
End-of-buffer handling duplicates FoundString construction.
The FoundString creation is duplicated across the max_length check branches and the main loop. Consider extracting a helper closure or function to reduce duplication.
♻️ Suggested refactor to reduce duplication
+ // Helper to create FoundString from accumulated bytes
+ let make_found_string = |bytes: Vec<u8>, start: usize, len: usize| -> FoundString {
+ let text = String::from_utf8(bytes).expect("ASCII bytes should be valid UTF-8");
+ FoundString {
+ text,
+ original_text: None,
+ encoding: Encoding::Ascii,
+ offset: start as u64,
+ rva: None,
+ section: None,
+ length: len as u32,
+ tags: Vec::new(),
+ score: 0,
+ section_weight: None,
+ semantic_boost: None,
+ noise_penalty: None,
+ source: StringSource::SectionData,
+ confidence: 1.0,
+ }
+ };Then use strings.push(make_found_string(bytes, start, len)); in both locations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/extraction/ascii/extraction.rs` around lines 120 - 169, The end-of-buffer
branch duplicates FoundString construction; extract a helper function or closure
(e.g., make_found_string) that takes the taken bytes, start, len and returns a
FoundString, then replace the duplicated blocks in the max_length branch and the
else branch (and any other places in this loop) with
strings.push(make_found_string(bytes, start, len)); ensure the helper sets the
same fields used now (text from String::from_utf8, Encoding::Ascii, offset from
current_string_start, length, source StringSource::SectionData, confidence 1.0,
etc.) and use std::mem::take(&mut current_string_bytes) only once per
invocation.
| #[test] | ||
| fn cli_min_length_flag() { | ||
| let output = Command::new(env!("CARGO_BIN_EXE_stringy")) | ||
| .args(["tests/fixtures/test_binary_elf", "-l", "20"]) | ||
| .output() | ||
| .expect("Failed to execute stringy"); | ||
|
|
||
| assert!(output.status.success(), "Exit code: {}", output.status); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
min_length test doesn't verify the flag's effect.
The test only checks exit success but doesn't verify that output strings are actually >= 20 characters. Without this, the test wouldn't catch if the flag was silently ignored.
Suggested improvement
#[test]
fn cli_min_length_flag() {
let output = Command::new(env!("CARGO_BIN_EXE_stringy"))
.args(["tests/fixtures/test_binary_elf", "-l", "20"])
.output()
.expect("Failed to execute stringy");
assert!(output.status.success(), "Exit code: {}", output.status);
+
+ // Verify all output strings meet minimum length
+ let stdout = String::from_utf8_lossy(&output.stdout);
+ for line in stdout.lines().filter(|l| !l.is_empty()) {
+ // In table format, the string is typically the first column
+ // This is a basic sanity check - adjust based on actual output format
+ assert!(
+ line.len() >= 20 || line.starts_with("Offset") || line.contains("---"),
+ "Output line should be >= 20 chars or be a header: {}",
+ line
+ );
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #[test] | |
| fn cli_min_length_flag() { | |
| let output = Command::new(env!("CARGO_BIN_EXE_stringy")) | |
| .args(["tests/fixtures/test_binary_elf", "-l", "20"]) | |
| .output() | |
| .expect("Failed to execute stringy"); | |
| assert!(output.status.success(), "Exit code: {}", output.status); | |
| } | |
| #[test] | |
| fn cli_min_length_flag() { | |
| let output = Command::new(env!("CARGO_BIN_EXE_stringy")) | |
| .args(["tests/fixtures/test_binary_elf", "-l", "20"]) | |
| .output() | |
| .expect("Failed to execute stringy"); | |
| assert!(output.status.success(), "Exit code: {}", output.status); | |
| // Verify all output strings meet minimum length | |
| let stdout = String::from_utf8_lossy(&output.stdout); | |
| for line in stdout.lines().filter(|l| !l.is_empty()) { | |
| // In table format, the string is typically the first column | |
| // This is a basic sanity check - adjust based on actual output format | |
| assert!( | |
| line.len() >= 20 || line.starts_with("Offset") || line.contains("---"), | |
| "Output line should be >= 20 chars or be a header: {}", | |
| line | |
| ); | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/integration_cli.rs` around lines 43 - 51, The cli_min_length_flag test
currently only asserts success; update the test (cli_min_length_flag) to also
read output.stdout (from Command::new(...).output()), convert it to a UTF-8
string, split into lines (or parse the expected output format), and assert that
every returned string/item has length >= 20 (or at least one string meets that
criteria depending on expected behavior). Use the existing Command invocation
and output variable names (output) and add assertions on
String::from_utf8_lossy(&output.stdout) (or equivalent) to verify the min-length
flag actually filtered/affected results.
There was a problem hiding this comment.
Pull request overview
This PR refactors the extraction module to comply with the 500-line file size limit and wires up the CLI to use the full extraction pipeline. All 6 oversized extraction files (totaling ~6,670 lines) have been split into focused module directories without changing total line count. The CLI now reads binaries, detects format, parses containers, extracts strings, and formats output.
Changes:
- Split 6 oversized extraction module files into focused subdirectories (mod.rs + specialized files + tests.rs)
- Wired up main.rs to implement the full extraction pipeline with format selection (table/json/yara) and min-length filtering
- Updated CI workflows and justfile to use
justrecipes as single source of truth, fixed cargo-dist invocation
Reviewed changes
Copilot reviewed 38 out of 38 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/main.rs | Implements CLI pipeline: read binary → detect format → parse → extract → format output |
| tests/integration_cli.rs | New CLI integration tests for file handling, JSON output, error cases, and min-length flag |
| src/extraction/traits.rs | Extracted ExtractionConfig, StringExtractor trait, and BasicExtractor struct from mod.rs |
| src/extraction/helpers.rs | Extracted helper functions (semantic enrichment, printable checks, ASCII/UTF-8 extraction) |
| src/extraction/basic_extractor.rs | Extracted BasicExtractor StringExtractor impl (extract, extract_canonical, extract_from_section) |
| src/extraction/tests.rs | Extracted 570 lines of tests from mod.rs |
| src/extraction/filters/* | Split filters.rs into mod.rs + implementations.rs + tests.rs |
| src/extraction/ascii/* | Split ascii.rs into mod.rs + extraction.rs + tests.rs |
| src/extraction/utf16/* | Split utf16.rs into 6 files: mod.rs + config.rs + validation.rs + confidence.rs + extraction.rs + tests.rs |
| src/extraction/dedup/* | Split dedup.rs into mod.rs + scoring.rs + tests.rs |
| src/extraction/pe_resources/* | Split pe_resources.rs into 6 files: mod.rs + detection.rs + version_info.rs + string_tables.rs + manifests.rs + tests.rs |
| justfile | Added --all-features to build-release/test-ci/coverage, added outdated recipe, fixed dist-check to use dist plan |
| .github/workflows/*.yml | Updated CI workflows to use just recipes instead of raw cargo commands |
| AGENTS.md | Documented CI architecture: use just recipes in CI, cargo-dist invocation pattern |
| docs/plans/*.md | Implementation plan documenting the refactoring process |
| .unwrap_or_else(|| cli.input.display().to_string()); | ||
|
|
||
| let output_format = cli.format.to_output_format(); | ||
| let metadata = OutputMetadata::new(binary_name, output_format, strings.len(), strings.len()); |
There was a problem hiding this comment.
The OutputMetadata constructor is called with strings.len() for both total_strings and filtered_strings parameters. This is incorrect - all strings are deduplicated/filtered by this point, so there's no distinction between total and filtered. Either pass different values to show actual filtering stats, or if the distinction doesn't apply here, the API should be clarified.
- Fix panic in CanonicalString::to_found_string() on empty occurrences (now returns Option<FoundString>) - Add #[allow] justification comments per project rules (4 files) - Remove redundant empty checks in confidence.rs and validation.rs - Strengthen test assertions: replace conditional passes with asserts, fix unused result, fix tautological assertion, implement empty test - Add max_length validation to ExtractionConfig::validate() - Improve CLI tests: check stderr on error, verify min_length filtering - Clarify AGENTS.md CI Architecture wording Signed-off-by: Keith Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Fix UTF-16 length calculation in version_info.rs and string_tables.rs: use encode_utf16().count() * 2 instead of text.len() for source bytes - Fix missing trailing newline in CLI output for table/JSON formatters - Remove unnecessary clone in basic_extractor.rs canonical string path - Use encoding from map key in dedup instead of re-indexing found_strings - Implement empty test bodies for string_tables and manifests detection - Assert unused results in PE resource tests instead of discarding - Add max_length vs min_ascii_length/min_wide_length validation Signed-off-by: Keith Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
| let default_output = Command::new(env!("CARGO_BIN_EXE_stringy")) | ||
| .arg("tests/fixtures/test_binary_elf") | ||
| .output() | ||
| .expect("Failed to execute stringy"); |
There was a problem hiding this comment.
cli_min_length_flag compares line counts against default_output, but never asserts that the default invocation succeeded. If the default run fails and returns empty stdout, this assertion can give misleading results. Add an assert!(default_output.status.success(), ...) before comparing outputs.
| is_printable_code_unit_or_pair(code_unit, next_code_unit); | ||
|
|
||
| if is_printable { | ||
| char_count += 1; | ||
| i += consumed_units * 2; |
There was a problem hiding this comment.
char_count is incremented by 1 even when consumed_units is 2 for surrogate pairs. This makes the length checks (min_length/max_length) effectively count Unicode scalar values, but Utf16ExtractionConfig documents min/max as UTF-16 code units. Either increment by consumed_units (so counts match the config semantics) or update the config docs and related logic to consistently use scalar-count semantics.
| let config = ExtractionConfig { | ||
| min_length: cli.min_length, | ||
| min_ascii_length: cli.min_length, | ||
| ..ExtractionConfig::default() | ||
| }; |
There was a problem hiding this comment.
The CLI --min-length currently only updates min_length/min_ascii_length. If UTF-16 extraction is enabled via config, min_wide_length will remain at its default, so the flag will not consistently apply across encodings. Also consider calling config.validate() (or enforcing a non-zero constraint at the clap level) so invalid values fail fast with a clear error.
- Validate min_length at CLI level (clap range 1..) and call config.validate() before extraction - Set min_wide_length from CLI --min-length flag for consistent filtering across all encodings - Assert default CLI run succeeds before comparing output in tests Signed-off-by: Keith Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Use usize directly for min_length CLI field instead of u64 with cast. config.validate() handles the >= 1 constraint. Signed-off-by: Keith Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/main.rs`:
- Around line 47-49: Add a clap-level range validator to the min_length argument
so invalid values (like 0) are rejected at parse time rather than later in
config.validate(); update the struct field attribute for min_length (the
#[arg(...)] on the min_length: usize field) to include a value_parser using
clap::value_parser!(usize).range(1..) so clap fails fast with a clear message.
| /// Minimum string length in bytes (must be >= 1) | ||
| #[arg(long, short = 'l', default_value_t = 4)] | ||
| min_length: usize, |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider adding value_parser range constraint for better UX.
Currently min_length accepts any usize, including 0, and relies on config.validate() to reject invalid values. While this works, clap's range parser provides immediate feedback with a clearer error message:
#[arg(long, short = 'l', default_value_t = 4, value_parser = clap::value_parser!(usize).range(1..))]
min_length: usize,This fails fast at argument parsing rather than mid-pipeline. The current approach is functional but shifts validation deeper into the call stack.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/main.rs` around lines 47 - 49, Add a clap-level range validator to the
min_length argument so invalid values (like 0) are rejected at parse time rather
than later in config.validate(); update the struct field attribute for
min_length (the #[arg(...)] on the min_length: usize field) to include a
value_parser using clap::value_parser!(usize).range(1..) so clap fails fast with
a clear message.
| let config = ExtractionConfig { | ||
| min_length: cli.min_length, | ||
| min_ascii_length: cli.min_length, | ||
| min_wide_length: cli.min_length, | ||
| ..ExtractionConfig::default() |
There was a problem hiding this comment.
--min-length is documented as a byte length, but the CLI maps it directly into min_wide_length, which is consumed as a UTF-16 code-unit length by the UTF-16 extractor. This makes UTF-16 filtering stricter than intended (eg min-length=4 bytes should correspond to 2 UTF-16 code units). Consider either (a) keeping separate CLI flags for byte vs UTF-16 code units, or (b) converting cli.min_length to min_wide_length = max(1, (cli.min_length + 1) / 2) and documenting the behavior.
| pub fn deduplicate( | ||
| strings: Vec<FoundString>, | ||
| dedup_threshold: Option<usize>, | ||
| _preserve_all_occurrences: bool, | ||
| ) -> Vec<CanonicalString> { |
There was a problem hiding this comment.
The preserve_all_occurrences argument is currently unused (named _preserve_all_occurrences) but the docstring promises different behavior when it is false ("only store occurrence count instead of full metadata"). This is misleading for callers. Either implement the reduced-metadata mode, or remove the parameter and update the docs/API accordingly.
- Fix STRINGTABLE truncated entry bug: break instead of continue when byte_length exceeds remaining buffer, preventing misaligned reads - Fix stale doc paths after refactoring (ascii/mod.rs, pe_resources/mod.rs) - Rename test module internal_tests -> tests (project convention) - Rename misleading test (no_translation -> fallback_language_handling) - Replace no-op match with actual type assertion in metadata validation - Split pe_resources/tests.rs (632 lines) into tests/ directory module - Split dedup/tests.rs (560 lines) into tests/ directory module - Add test for STRINGTABLE truncated entry parsing behavior Signed-off-by: Keith Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/extraction/ascii/mod.rs`:
- Around line 39-108: The AsciiExtractionConfig currently allows invalid values
(min_length == 0 or max_length < min_length); add validation to prevent these
invariants by implementing either a checked constructor and/or a validate()
method: update AsciiExtractionConfig::new (and/or implement TryFrom/try_new) to
return a Result<AsciiExtractionConfig, StringyError> that rejects min_length < 1
and rejects Some(max) where max < min_length, and update Default (or keep it) to
produce valid defaults; ensure any callers that construct AsciiExtractionConfig
(including use of Default or new) call the fallible constructor or validate() so
invalid configs are rejected upstream.
In `@src/extraction/dedup/tests/basic.rs`:
- Around line 102-112: The test relies on insertion order of occurrences; before
asserting positions, sort the occurrences vector by offset to make the test
deterministic: after obtaining canonical = deduplicate(strings, None, true) and
let occ = &canonical[0].occurrences, sort occ in-place by the offset field
(e.g., using sort_by_key or sort_unstable_by_key on occ.offset) and then perform
the existing assertions on occ[0] and occ[1]; this keeps the change localized to
the test (basic.rs) and references the canonical variable and deduplicate call.
In `@src/extraction/dedup/tests/scoring.rs`:
- Around line 201-250: Update the inline comment in the test_length_preservation
unit test to avoid ambiguity about .text.len(): either reword the comment on the
assertion from "4 characters" to "4 bytes (ASCII)" to be precise about
.text.len() returning byte length for ASCII, or change the assertion check to
use text.chars().count() if you intend to assert character count; reference the
test function name test_length_preservation and the calls to
found.text.len()/.length to locate where to adjust the comment or the check.
- Around line 31-35: The test uses magic numbers for scoring; define and export
module-level constants (e.g., OCCURRENCE_BONUS = 5, CROSS_SECTION_BONUS = 10,
MULTI_SOURCE_BONUS = 15) in the scoring module where calculate_combined_score()
is implemented, then import those constants into the test and replace the
hardcoded 5/10/15 in the deduplicate() test assertion (which checks
canonical[0].combined_score) with the named constants so the test reads base +
OCCURRENCE_BONUS + (confidence calculation) and stays in sync if weights change.
In `@src/extraction/pe_resources/mod.rs`:
- Around line 149-170: extract_resources currently swallows PE parse and
resource access errors and returns an empty Vec; change its signature to return
Result<Vec<ResourceMetadata>, StringyError>, map PeFile::from_bytes errors and
pe.resources() errors into appropriate StringyError variants from types.rs
(e.g., a parse/pe variant and a resource-access variant) with contextual
messages (do not use raw strings), and propagate enumerate_resources results via
? or map_err to a StringyError rather than unwrap_or_default; also update
callers to handle the Result return so they can distinguish "no resources" from
"parse failed".
In `@src/extraction/pe_resources/tests/mod.rs`:
- Around line 14-20: The helper function get_fixture_path is used by multiple
test submodules but is currently private; change its visibility to pub(crate) so
other tests in the crate can access it. Locate the function named
get_fixture_path in src/extraction/pe_resources/tests/mod.rs and add the
pub(crate) visibility modifier to its declaration, keeping the implementation
and signature otherwise unchanged so it remains a crate-scoped test helper.
---
Duplicate comments:
In `@src/extraction/pe_resources/string_tables.rs`:
- Around line 81-100: Change this function to return Result<Vec<FoundString>,
StringyError> instead of Vec so parse failures are not swallowed; replace each
early return like from PeFile::from_bytes, pe.resources(), resources.root(), and
root.get_dir(Name::Id(RT_STRING)) with propagation of a StringyError variant
(use the project's StringyError enum from types.rs) that includes context about
which resource parse step failed; ensure downstream callers of this function are
updated to handle the Result and preserve error context rather than mapping
failures to an empty Vec.
- Around line 68-80: The public function extract_string_table_strings currently
lacks a rustdoc example; either add a short usage example showing calling
extract_string_table_strings(data: &[u8]) and what it returns (Vec<FoundString>)
or, if this is internal-only, reduce its visibility to pub(crate) by changing
the signature from pub fn extract_string_table_strings(...) to pub(crate) fn
extract_string_table_strings(...); ensure any added example references the
FoundString type and a minimal input byte slice to satisfy the exported-API
documentation requirement.
- Around line 149-163: The FoundString construction in string_tables.rs leaves
provenance blank (offset=0, rva=None, section_weight=None); update the code that
builds FoundString so it derives the file offset and RVA from the pelite
resource entry or the containing section headers (use the resource entry's data
pointer/base and section header virtual_address to compute offset and rva),
populate original_text if available, set section to ".rsrc" and set
section_weight to 9.0 for resource strings, and keep source as
StringSource::ResourceString so downstream section-aware scoring can use these
fields (update the FoundString fields offset, rva, section_weight accordingly).
| /// Configuration for ASCII string extraction | ||
| /// | ||
| /// Controls minimum and maximum string length filtering. This structure serves as the | ||
| /// foundation for future configuration expansion, including encoding preferences and | ||
| /// tag filters as mentioned in the issue. | ||
| /// | ||
| /// # Default Values | ||
| /// | ||
| /// - `min_length`: 4 (standard minimum to reduce noise) | ||
| /// - `max_length`: None (no upper limit by default) | ||
| /// | ||
| /// # Examples | ||
| /// | ||
| /// ```rust | ||
| /// use stringy::extraction::ascii::AsciiExtractionConfig; | ||
| /// | ||
| /// // Use default configuration | ||
| /// let config = AsciiExtractionConfig::default(); | ||
| /// | ||
| /// // Custom minimum length | ||
| /// let config = AsciiExtractionConfig::new(8); | ||
| /// | ||
| /// // Custom minimum and maximum length | ||
| /// let mut config = AsciiExtractionConfig::default(); | ||
| /// config.max_length = Some(256); | ||
| /// ``` | ||
| #[derive(Debug, Clone)] | ||
| pub struct AsciiExtractionConfig { | ||
| /// Minimum string length in bytes (default: 4) | ||
| pub min_length: usize, | ||
| /// Maximum string length in bytes (default: None, no limit) | ||
| pub max_length: Option<usize>, | ||
| } | ||
|
|
||
| impl Default for AsciiExtractionConfig { | ||
| fn default() -> Self { | ||
| Self { | ||
| min_length: 4, | ||
| max_length: None, | ||
| } | ||
| } | ||
| } | ||
|
|
||
| impl AsciiExtractionConfig { | ||
| /// Create a new AsciiExtractionConfig with custom minimum length | ||
| /// | ||
| /// # Arguments | ||
| /// | ||
| /// * `min_length` - Minimum string length in bytes | ||
| /// | ||
| /// # Returns | ||
| /// | ||
| /// New AsciiExtractionConfig with specified minimum length and default max_length (None) | ||
| /// | ||
| /// # Example | ||
| /// | ||
| /// ```rust | ||
| /// use stringy::extraction::ascii::AsciiExtractionConfig; | ||
| /// | ||
| /// let config = AsciiExtractionConfig::new(8); | ||
| /// assert_eq!(config.min_length, 8); | ||
| /// assert_eq!(config.max_length, None); | ||
| /// ``` | ||
| pub fn new(min_length: usize) -> Self { | ||
| Self { | ||
| min_length, | ||
| max_length: None, | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Validate AsciiExtractionConfig invariants.
min_length can be 0 and max_length can be set below min_length, which can cause excessive output and false positives. Consider a validated constructor (or a validate() method) that returns a StringyError when min_length < 1 or max_length < min_length, or enforce these checks at the CLI boundary.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/extraction/ascii/mod.rs` around lines 39 - 108, The AsciiExtractionConfig
currently allows invalid values (min_length == 0 or max_length < min_length);
add validation to prevent these invariants by implementing either a checked
constructor and/or a validate() method: update AsciiExtractionConfig::new
(and/or implement TryFrom/try_new) to return a Result<AsciiExtractionConfig,
StringyError> that rejects min_length < 1 and rejects Some(max) where max <
min_length, and update Default (or keep it) to produce valid defaults; ensure
any callers that construct AsciiExtractionConfig (including use of Default or
new) call the fallible constructor or validate() so invalid configs are rejected
upstream.
| #[test] | ||
| fn test_length_preservation() { | ||
| // Test that length is preserved correctly for UTF-16 strings | ||
| let strings = vec![ | ||
| FoundString { | ||
| text: "Test".to_string(), | ||
| original_text: None, | ||
| encoding: Encoding::Utf16Le, | ||
| offset: 0x100, | ||
| rva: Some(0x1000), | ||
| section: None, | ||
| length: 8, // 4 characters * 2 bytes = 8 bytes | ||
| tags: vec![], | ||
| score: 10, | ||
| section_weight: None, | ||
| semantic_boost: None, | ||
| noise_penalty: None, | ||
| source: StringSource::SectionData, | ||
| confidence: 0.8, | ||
| }, | ||
| FoundString { | ||
| text: "Test".to_string(), | ||
| original_text: None, | ||
| encoding: Encoding::Utf16Le, | ||
| offset: 0x200, | ||
| rva: Some(0x2000), | ||
| section: None, | ||
| length: 8, | ||
| tags: vec![], | ||
| score: 15, | ||
| section_weight: None, | ||
| semantic_boost: None, | ||
| noise_penalty: None, | ||
| source: StringSource::SectionData, | ||
| confidence: 0.9, | ||
| }, | ||
| ]; | ||
|
|
||
| let canonical = deduplicate(strings, None, true); | ||
| assert_eq!(canonical.len(), 1); | ||
| assert_eq!(canonical[0].occurrences[0].length, 8); | ||
| assert_eq!(canonical[0].occurrences[1].length, 8); | ||
|
|
||
| // Verify to_found_string() uses stored length, not text.len() | ||
| let found = canonical[0] | ||
| .to_found_string() | ||
| .expect("canonical string with occurrences should convert"); | ||
| assert_eq!(found.length, 8); // Should be 8 bytes, not 4 (text.len()) | ||
| assert_eq!(found.text.len(), 4); // But text is still 4 characters | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
LGTM - length preservation test, minor comment clarification.
Good test for UTF-16 byte-length preservation through the dedup pipeline.
Line 249: The comment says "4 characters" but text.len() returns bytes. For ASCII "Test", these happen to be equal. Consider rewording to "4 bytes (ASCII)" for precision, or use text.chars().count() if you intend to count characters.
Clarify comment
- assert_eq!(found.text.len(), 4); // But text is still 4 characters
+ assert_eq!(found.text.len(), 4); // text.len() returns bytes; "Test" is 4 ASCII bytes📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #[test] | |
| fn test_length_preservation() { | |
| // Test that length is preserved correctly for UTF-16 strings | |
| let strings = vec![ | |
| FoundString { | |
| text: "Test".to_string(), | |
| original_text: None, | |
| encoding: Encoding::Utf16Le, | |
| offset: 0x100, | |
| rva: Some(0x1000), | |
| section: None, | |
| length: 8, // 4 characters * 2 bytes = 8 bytes | |
| tags: vec![], | |
| score: 10, | |
| section_weight: None, | |
| semantic_boost: None, | |
| noise_penalty: None, | |
| source: StringSource::SectionData, | |
| confidence: 0.8, | |
| }, | |
| FoundString { | |
| text: "Test".to_string(), | |
| original_text: None, | |
| encoding: Encoding::Utf16Le, | |
| offset: 0x200, | |
| rva: Some(0x2000), | |
| section: None, | |
| length: 8, | |
| tags: vec![], | |
| score: 15, | |
| section_weight: None, | |
| semantic_boost: None, | |
| noise_penalty: None, | |
| source: StringSource::SectionData, | |
| confidence: 0.9, | |
| }, | |
| ]; | |
| let canonical = deduplicate(strings, None, true); | |
| assert_eq!(canonical.len(), 1); | |
| assert_eq!(canonical[0].occurrences[0].length, 8); | |
| assert_eq!(canonical[0].occurrences[1].length, 8); | |
| // Verify to_found_string() uses stored length, not text.len() | |
| let found = canonical[0] | |
| .to_found_string() | |
| .expect("canonical string with occurrences should convert"); | |
| assert_eq!(found.length, 8); // Should be 8 bytes, not 4 (text.len()) | |
| assert_eq!(found.text.len(), 4); // But text is still 4 characters | |
| } | |
| #[test] | |
| fn test_length_preservation() { | |
| // Test that length is preserved correctly for UTF-16 strings | |
| let strings = vec![ | |
| FoundString { | |
| text: "Test".to_string(), | |
| original_text: None, | |
| encoding: Encoding::Utf16Le, | |
| offset: 0x100, | |
| rva: Some(0x1000), | |
| section: None, | |
| length: 8, // 4 characters * 2 bytes = 8 bytes | |
| tags: vec![], | |
| score: 10, | |
| section_weight: None, | |
| semantic_boost: None, | |
| noise_penalty: None, | |
| source: StringSource::SectionData, | |
| confidence: 0.8, | |
| }, | |
| FoundString { | |
| text: "Test".to_string(), | |
| original_text: None, | |
| encoding: Encoding::Utf16Le, | |
| offset: 0x200, | |
| rva: Some(0x2000), | |
| section: None, | |
| length: 8, | |
| tags: vec![], | |
| score: 15, | |
| section_weight: None, | |
| semantic_boost: None, | |
| noise_penalty: None, | |
| source: StringSource::SectionData, | |
| confidence: 0.9, | |
| }, | |
| ]; | |
| let canonical = deduplicate(strings, None, true); | |
| assert_eq!(canonical.len(), 1); | |
| assert_eq!(canonical[0].occurrences[0].length, 8); | |
| assert_eq!(canonical[0].occurrences[1].length, 8); | |
| // Verify to_found_string() uses stored length, not text.len() | |
| let found = canonical[0] | |
| .to_found_string() | |
| .expect("canonical string with occurrences should convert"); | |
| assert_eq!(found.length, 8); // Should be 8 bytes, not 4 (text.len()) | |
| assert_eq!(found.text.len(), 4); // text.len() returns bytes; "Test" is 4 ASCII bytes | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/extraction/dedup/tests/scoring.rs` around lines 201 - 250, Update the
inline comment in the test_length_preservation unit test to avoid ambiguity
about .text.len(): either reword the comment on the assertion from "4
characters" to "4 bytes (ASCII)" to be precise about .text.len() returning byte
length for ASCII, or change the assertion check to use text.chars().count() if
you intend to assert character count; reference the test function name
test_length_preservation and the calls to found.text.len()/.length to locate
where to adjust the comment or the check.
| pub fn extract_resources(data: &[u8]) -> Vec<ResourceMetadata> { | ||
| // Attempt to parse PE using pelite | ||
| let pe = match PeFile::from_bytes(data) { | ||
| Ok(pe) => pe, | ||
| Err(_) => { | ||
| // Graceful degradation: return empty vec on parse error | ||
| // This allows PE parsing to succeed even if resource extraction fails | ||
| return Vec::new(); | ||
| } | ||
| }; | ||
|
|
||
| // Get resource directory | ||
| let resources = match pe.resources() { | ||
| Ok(resources) => resources, | ||
| Err(_) => { | ||
| // No resource directory or error accessing it - not an error condition | ||
| return Vec::new(); | ||
| } | ||
| }; | ||
|
|
||
| // Enumerate all resources - handle errors gracefully | ||
| enumerate_resources(&resources).unwrap_or_default() |
There was a problem hiding this comment.
Return errors instead of empty Vec for resource parsing failures.
extract_resources discards parse failures and returns an empty vector, which hides errors from callers. Please return Result<Vec, StringyError> and propagate context so callers can distinguish "no resources" from "parse failed".
As per coding guidelines, "Use comprehensive error handling with a StringyError enum across the codebase" and "Use StringyError variants from types.rs for error handling, never raw strings".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/extraction/pe_resources/mod.rs` around lines 149 - 170, extract_resources
currently swallows PE parse and resource access errors and returns an empty Vec;
change its signature to return Result<Vec<ResourceMetadata>, StringyError>, map
PeFile::from_bytes errors and pe.resources() errors into appropriate
StringyError variants from types.rs (e.g., a parse/pe variant and a
resource-access variant) with contextual messages (do not use raw strings), and
propagate enumerate_resources results via ? or map_err to a StringyError rather
than unwrap_or_default; also update callers to handle the Result return so they
can distinguish "no resources" from "parse failed".
| /// Helper to get fixture path | ||
| fn get_fixture_path(name: &str) -> std::path::PathBuf { | ||
| Path::new(env!("CARGO_MANIFEST_DIR")) | ||
| .join("tests") | ||
| .join("fixtures") | ||
| .join(name) | ||
| } |
There was a problem hiding this comment.
Use pub(crate) for shared test helpers.
get_fixture_path is used across test submodules; mark it pub(crate) to align with the test helper visibility rule.
Proposed change
-fn get_fixture_path(name: &str) -> std::path::PathBuf {
+pub(crate) fn get_fixture_path(name: &str) -> std::path::PathBuf {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /// Helper to get fixture path | |
| fn get_fixture_path(name: &str) -> std::path::PathBuf { | |
| Path::new(env!("CARGO_MANIFEST_DIR")) | |
| .join("tests") | |
| .join("fixtures") | |
| .join(name) | |
| } | |
| /// Helper to get fixture path | |
| pub(crate) fn get_fixture_path(name: &str) -> std::path::PathBuf { | |
| Path::new(env!("CARGO_MANIFEST_DIR")) | |
| .join("tests") | |
| .join("fixtures") | |
| .join(name) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/extraction/pe_resources/tests/mod.rs` around lines 14 - 20, The helper
function get_fixture_path is used by multiple test submodules but is currently
private; change its visibility to pub(crate) so other tests in the crate can
access it. Locate the function named get_fixture_path in
src/extraction/pe_resources/tests/mod.rs and add the pub(crate) visibility
modifier to its declaration, keeping the implementation and signature otherwise
unchanged so it remains a crate-scoped test helper.
Address 8 pre-existing design issues surfaced during bot review of PR #135: 1. Remove dual encoding fields: delete `encodings` from ExtractionConfig, keep only `enabled_encodings`. Eliminates confusing OR-checks throughout basic_extractor.rs. 2. Extract shared logic from extract/extract_canonical: factor out the duplicated section-sorting, filtering, symbol-inclusion, and semantic enrichment pipeline into a private `collect_all_strings()` helper. 3. Fix O(n^2) dedup in Auto byte order mode: replace `strings.iter_mut().find()` with HashMap-based O(1) lookup in utf16/mod.rs. 4. Remove unused `_preserve_all_occurrences` parameter from `deduplicate()` signature, ExtractionConfig struct, and all callsites. 5. Remove broken byte order consistency heuristic: the `check_byte_order_consistency()` function operated on decoded u16 values which are identical regardless of original byte order, making LE always score 1.0 and BE always 0.0. Redistribute the 0.1 weight to valid_unicode (0.35) and printable (0.45) heuristics. 6. Fix char_count semantics for surrogate pairs: change `char_count += 1` to `char_count += consumed_units` in utf16/extraction.rs so min/max length checks count UTF-16 code units, not scalar values. 7. Replace `Result<_, ()>` with `StringyError` in UTF-16 decode functions: use `StringyError::ParseError` for the 3 decode functions in utf16/extraction.rs, removing `#[allow(clippy::result_unit_err)]`. 8. Add noncharacter validation in surrogate-pair path: reject U+xFFFE and U+xFFFF in supplementary planes (planes 1-16) in utf16/validation.rs, matching the BMP noncharacter check. Signed-off-by: Kevin Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
| /// Minimum length for ASCII strings (default: 4, same as min_length) | ||
| pub min_ascii_length: usize, | ||
| /// Minimum length for UTF-16 strings (default: 3, for future use) | ||
| pub min_wide_length: usize, | ||
| /// Which encodings to extract (default: ASCII, UTF-8) |
There was a problem hiding this comment.
min_wide_length is documented as a general "minimum length" but is currently used as Utf16ExtractionConfig::min_length (UTF-16 code units) in basic_extractor.rs. The current validate() logic also compares max_length (bytes) against min_wide_length directly, which is inconsistent if min_wide_length is in UTF-16 units. Please clarify the unit for min_wide_length in the doc comment and adjust validation/conversion accordingly (eg validate against max_length / 2 if it is UTF-16 code units, or convert it to code units at the use-site if you want to keep it byte-based).
| if self.max_length < self.min_wide_length { | ||
| return Err(StringyError::ConfigError(format!( | ||
| "max_length ({}) must be >= min_wide_length ({})", | ||
| self.max_length, self.min_wide_length | ||
| ))); |
There was a problem hiding this comment.
validate() currently checks self.max_length < self.min_wide_length, but max_length is in bytes while min_wide_length is used as the UTF-16 min_length (code units). This can both allow invalid UTF-16 configs (if min_wide_length is units) and reject valid ones (if callers treat it as bytes). Align the units here (eg compare max_length / 2 against min_wide_length if min_wide_length is UTF-16 units).
Add .github/actionlint.yml config to suppress lint errors from release.yml, which is auto-generated by cargo-dist and cannot be hand-edited. Signed-off-by: Kevin Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
The cspell tool was configured but never installed in mise, so lint-spell always failed. The project's domain-heavy jargon (binary analysis terms, hex patterns) required 70+ exception words, making the maintenance cost outweigh the marginal value. Typos in identifiers are caught at compile time, and documentation quality is covered by CodeRabbit reviews. Signed-off-by: Kevin Melton <kmelton@evilbitlabs.io> Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Merge Queue StatusRule:
This pull request spent 4 minutes 52 seconds in the queue, including 4 minutes 42 seconds running CI. Required conditions to merge
|
#135) ## Summary - **CI**: Use `just` recipes in CI workflows instead of raw cargo commands; fix cargo-dist invocation (use `dist plan` binary via mise instead of `cargo dist check` subcommand) - **Extraction refactor**: Split 6 oversized files (all >500 lines) in `src/extraction/` into focused module directories -- total lines unchanged, all non-test files now under 500 lines - **CLI**: Wire up the extraction pipeline in `main.rs` with format selection (`--format table|json|yara`) and minimum length filtering (`--min-length`) - **Design debt cleanup**: Resolve 8 pre-existing design issues surfaced during bot review ## Details ### CI changes - `ci.yml`: test, test-cross-platform, and coverage jobs now use `just test-ci`, `just build-release`, `just coverage` - `security.yml`: uses `just deny`, `just outdated`, `just dist-check` (fixes cargo-dist not found) - `docs.yml`: uses `just docs-build` for combined mdbook + rustdoc build - `justfile`: updated recipes with `--all-features`, added `outdated` recipe, fixed `dist-check` to use `dist plan` ### Extraction module splits | Original file | Lines | Split into | |---|---|---| | `extraction/mod.rs` | 1,576 | `mod.rs` (154) + `traits.rs` + `helpers.rs` + `basic_extractor.rs` + `tests.rs` | | `extraction/filters.rs` | 702 | `filters/mod.rs` + `filters/implementations.rs` + `filters/tests.rs` | | `extraction/ascii.rs` | 832 | `ascii/mod.rs` + `ascii/extraction.rs` + `ascii/tests.rs` | | `extraction/utf16.rs` | 1,273 | `utf16/mod.rs` + `utf16/config.rs` + `utf16/validation.rs` + `utf16/confidence.rs` + `utf16/extraction.rs` + `utf16/tests.rs` | | `extraction/dedup.rs` | 838 | `dedup/mod.rs` + `dedup/scoring.rs` + `dedup/tests.rs` | | `extraction/pe_resources.rs` | 1,449 | `pe_resources/mod.rs` + `pe_resources/detection.rs` + `pe_resources/version_info.rs` + `pe_resources/string_tables.rs` + `pe_resources/manifests.rs` + `pe_resources/tests.rs` | ### CLI pipeline - `main.rs`: reads binary, detects format, parses container, extracts strings, formats output - 4 new integration tests in `tests/integration_cli.rs` ### Design debt cleanup (8 items) 1. **Remove dual encoding fields**: Deleted `encodings` from `ExtractionConfig`, kept only `enabled_encodings`. Eliminated confusing OR-checks in `basic_extractor.rs`. 2. **Extract shared logic**: Factored duplicated section-sorting/filtering/symbol-inclusion pipeline from `extract()` and `extract_canonical()` into `collect_all_strings()` helper. 3. **Fix O(n^2) dedup in Auto mode**: Replaced `strings.iter_mut().find()` with HashMap-based O(1) lookup in `utf16/mod.rs`. 4. **Remove unused parameter**: Removed `_preserve_all_occurrences` from `deduplicate()` signature, `ExtractionConfig`, and all callsites. 5. **Remove broken byte order heuristic**: `check_byte_order_consistency()` operated on decoded u16 values (identical regardless of byte order), making LE always score 1.0 and BE always 0.0. Removed and redistributed weight. 6. **Fix char_count semantics**: Changed `char_count += 1` to `+= consumed_units` in UTF-16 extraction so min/max length checks count code units, not scalar values. 7. **Replace `Result<_, ()>`**: Changed 3 UTF-16 decode functions to return `StringyError::ParseError`, removed `#[allow(clippy::result_unit_err)]`. 8. **Add surrogate noncharacter validation**: Reject U+xFFFE/U+xFFFF in supplementary planes (planes 1-16), matching existing BMP noncharacter check. ## Test plan - [x] `just ci-check` passes locally (478 tests, zero clippy warnings) - [ ] CI passes on all platforms (ubuntu, macos, windows) - [x] CodeRabbit/Copilot review feedback addressed - [x] Pre-existing design debt resolved (8 items) Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Summary
justrecipes in CI workflows instead of raw cargo commands; fix cargo-dist invocation (usedist planbinary via mise instead ofcargo dist checksubcommand)src/extraction/into focused module directories -- total lines unchanged, all non-test files now under 500 linesmain.rswith format selection (--format table|json|yara) and minimum length filtering (--min-length)Details
CI changes
ci.yml: test, test-cross-platform, and coverage jobs now usejust test-ci,just build-release,just coveragesecurity.yml: usesjust deny,just outdated,just dist-check(fixes cargo-dist not found)docs.yml: usesjust docs-buildfor combined mdbook + rustdoc buildjustfile: updated recipes with--all-features, addedoutdatedrecipe, fixeddist-checkto usedist planExtraction module splits
extraction/mod.rsmod.rs(154) +traits.rs+helpers.rs+basic_extractor.rs+tests.rsextraction/filters.rsfilters/mod.rs+filters/implementations.rs+filters/tests.rsextraction/ascii.rsascii/mod.rs+ascii/extraction.rs+ascii/tests.rsextraction/utf16.rsutf16/mod.rs+utf16/config.rs+utf16/validation.rs+utf16/confidence.rs+utf16/extraction.rs+utf16/tests.rsextraction/dedup.rsdedup/mod.rs+dedup/scoring.rs+dedup/tests.rsextraction/pe_resources.rspe_resources/mod.rs+pe_resources/detection.rs+pe_resources/version_info.rs+pe_resources/string_tables.rs+pe_resources/manifests.rs+pe_resources/tests.rsCLI pipeline
main.rs: reads binary, detects format, parses container, extracts strings, formats outputtests/integration_cli.rsDesign debt cleanup (8 items)
encodingsfromExtractionConfig, kept onlyenabled_encodings. Eliminated confusing OR-checks inbasic_extractor.rs.extract()andextract_canonical()intocollect_all_strings()helper.strings.iter_mut().find()with HashMap-based O(1) lookup inutf16/mod.rs._preserve_all_occurrencesfromdeduplicate()signature,ExtractionConfig, and all callsites.check_byte_order_consistency()operated on decoded u16 values (identical regardless of byte order), making LE always score 1.0 and BE always 0.0. Removed and redistributed weight.char_count += 1to+= consumed_unitsin UTF-16 extraction so min/max length checks count code units, not scalar values.Result<_, ()>: Changed 3 UTF-16 decode functions to returnStringyError::ParseError, removed#[allow(clippy::result_unit_err)].Test plan
just ci-checkpasses locally (478 tests, zero clippy warnings)Generated with Claude Code