Add String Extraction Framework with Noise Filtering and Benchmarks#108
Conversation
…tring extraction - Added the BasicExtractor struct to implement a sequential ASCII/UTF-8 string extraction algorithm from binary data. - Introduced the ExtractionConfig struct to allow customization of extraction parameters, including minimum and maximum string lengths, encoding selection, and section filtering. - Updated documentation to include usage examples for both BasicExtractor and ExtractionConfig. - Added integration tests to validate the functionality of string extraction from various binary formats, including ELF and PE. This enhancement improves the library's ability to extract meaningful strings from binary files, facilitating better analysis and understanding of their content. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
…entation - Introduced a new module for ASCII string extraction, providing foundational functionality for extracting contiguous printable ASCII sequences from binary data. - Implemented `extract_ascii_strings` and `extract_from_section` functions, along with an `ExtractionConfig` struct for customizable extraction parameters. - Updated the main extraction module documentation to include details about the new ASCII extraction capabilities and usage examples. - Enhanced the existing documentation to reflect the addition of ASCII extraction in the library's structure and functionality. This addition improves the library's ability to handle ASCII string extraction, paving the way for future enhancements in UTF-8 and other encoding types. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
…g and update related documentation - Renamed `ExtractionConfig` to `AsciiExtractionConfig` to better reflect its purpose in ASCII string extraction. - Updated all references in the codebase and documentation to use the new name, ensuring consistency. - Enhanced documentation to clarify usage examples and configuration details for ASCII extraction. - Improved function signatures and comments to align with the new naming convention. This refactor improves code clarity and maintains a consistent naming scheme across the extraction module, facilitating better understanding and usage of the ASCII extraction functionality. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
…g and new benchmarks - Added a new `entropy` dependency to support advanced noise filtering capabilities in ASCII string extraction. - Introduced a new benchmark for ASCII extraction, `ascii_extraction`, to evaluate performance under various conditions. - Updated the `FoundString` struct to include a `confidence` field, representing the likelihood of a string being legitimate versus noise. - Enhanced the ASCII extraction functions to compute confidence scores using a composite noise filter, allowing for better filtering of false positives. - Updated documentation to reflect the new noise filtering features and provided examples for configuring extraction parameters. - Added unit and integration tests to validate the new noise filtering functionality and ensure robust performance. This enhancement significantly improves the accuracy and reliability of ASCII string extraction, making it more effective for analyzing binary data. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
|
Caution Review failedThe pull request is closed. Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. Summary by CodeRabbitRelease Notes
WalkthroughAdds a complete ASCII/UTF-8 string extraction framework with a BasicExtractor and ExtractionConfig, a multi-filter noise‑filtering system with weighted filters and per-string confidence, AsciiExtraction utilities and benchmarks, docs updates, and extensive unit/integration tests; integrates confidence into existing extraction paths and re-exports new APIs at the crate root. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as User Code
participant BE as BasicExtractor
participant AS as ASCII Scanner
participant CNF as CompositeNoiseFilter
participant SF as Sub-filter
participant FS as FoundString
Caller->>BE: extract(data, ExtractionConfig)
activate BE
BE->>AS: scan section bytes (AsciiExtractionConfig)
AS-->>BE: candidates (Vec<FoundString>)
alt noise_filtering_enabled
BE->>CNF: CompositeNoiseFilter::new(NoiseFilterConfig)
loop each candidate
BE->>CNF: calculate_confidence(text, FilterContext)
loop enabled sub-filters
CNF->>SF: calculate_confidence(text, context)
SF-->>CNF: score
end
CNF-->>BE: normalized confidence
BE->>FS: set confidence, apply threshold
end
else filtering_disabled
loop each candidate
BE->>FS: set confidence = 1.0
end
end
BE-->>Caller: filtered Vec<FoundString>
deactivate BE
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Files needing extra attention:
Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro Disabled knowledge base sources:
📒 Files selected for processing (17)
Warning Review ran into problems🔥 ProblemsErrors were encountered while retrieving linked issues. Errors (1)
Comment |
- Bumped `entropy` to version 0.4.2 for improved functionality. - Updated `pelite` to version 0.10.0 for better compatibility. - Upgraded `serde_json` to version 1.0.145 to incorporate the latest features and fixes. - Updated `insta` to version 1.43.2 and `tempfile` to version 3.23.0 for enhanced testing capabilities. These updates ensure the project utilizes the latest versions of dependencies, improving overall stability and performance. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Updated `actions/upload-artifact` from version 4 to 5 across multiple workflows for improved artifact management. - Updated `actions/download-artifact` from version 5 to 6 to leverage new features and enhancements. - Updated `actions/attest-build-provenance` from version 2 to 3 for better build provenance tracking. - Adjusted the `targets` list formatting in `dist-workspace.toml` for improved readability. These updates ensure the workflows are utilizing the latest versions of the actions, enhancing overall CI/CD performance and reliability. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
|
@coderabbitai fix the PR title |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain@unclesp1d3r I'll update the PR title to be more descriptive and properly formatted. 🏁 Script executed: Length of output: 128 @unclesp1d3r Done! I've updated the PR title from: Old: New: The new title is properly capitalized, removes the issue number prefix and redundant PR reference, and clearly describes the key features introduced in this PR. |
There was a problem hiding this comment.
Pull Request Overview
This pull request introduces a comprehensive string extraction and noise filtering framework for StringyMcStringFace. The implementation provides ASCII string extraction as a foundational encoding type, along with multi-layered heuristic noise filtering to reduce false positives. The changes also include extensive benchmarking infrastructure and detailed documentation.
Key Changes:
- Added ASCII string extraction module with configurable min/max length filtering
- Implemented 6-layer noise filtering system (entropy, character distribution, linguistic patterns, length, repetition, context-aware)
- Added
confidencefield (0.0-1.0) toFoundStringto track noise filtering results - Introduced
StringExtractortrait andBasicExtractorimplementation for framework-based extraction - Added comprehensive benchmark suite for performance validation
- Expanded documentation with filtering architecture details and usage examples
Reviewed Changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/extraction/ascii.rs |
New ASCII extraction module with byte-level scanning |
src/extraction/filters.rs |
Multi-layered noise filtering implementation |
src/extraction/config.rs |
Configuration structures for extraction and filtering |
src/extraction/mod.rs |
Core extraction framework with StringExtractor trait |
src/types.rs |
Added confidence field to FoundString |
tests/test_noise_filters.rs |
Comprehensive unit tests for noise filters |
tests/test_ascii_extraction.rs |
Unit tests for ASCII extraction |
tests/test_ascii_integration.rs |
Integration tests with noise filtering |
tests/integration_extraction.rs |
Integration tests for extraction framework |
benches/ascii_extraction.rs |
Benchmark suite for extraction and filtering |
docs/src/string-extraction.md |
Expanded documentation for extraction and filtering |
Cargo.toml |
Added entropy dependency and benchmark registration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pub fn is_printable_ascii(byte: u8) -> bool { | ||
| (0x20..=0x7E).contains(&byte) |
There was a problem hiding this comment.
[nitpick] The is_printable_ascii function in the ASCII module excludes whitespace control characters (tab, newline, carriage return) from the printable range, only accepting 0x20-0x7E. However, the documentation mentions that this differs from is_printable_text_byte in extraction::mod which includes these characters. This inconsistency may cause confusion when strings contain whitespace.
Consider whether ASCII extraction should include common whitespace characters like tab (\t, 0x09), newline (\n, 0x0A), and carriage return (\r, 0x0D) to better handle formatted text and multi-line strings. The current strict definition may filter out legitimate strings that contain these characters.
| // Check for repeated substrings (optimized to avoid O(n^3)) | ||
| // Cap pattern_len to a small bound (8-16) to avoid excessive computation | ||
| let max_pattern_len = (total as usize / 3).min(16).min(chars.len()); | ||
|
|
||
| if total >= 6.0 && max_pattern_len > 0 { | ||
| // Early exit optimization: if we can't possibly get 3 repetitions, skip | ||
| let min_pattern_len_for_3_reps = ((total as usize) as f32 / 3.0).ceil() as usize; | ||
| if min_pattern_len_for_3_reps > max_pattern_len { | ||
| return 1.0; // Can't have 3 repetitions, so no issue | ||
| } | ||
|
|
||
| // Check patterns starting from length 1 up to max_pattern_len | ||
| for pattern_len in 1..=max_pattern_len { | ||
| // Early exit: if pattern_len is too large to repeat 3 times, skip | ||
| if pattern_len * 3 > chars.len() { | ||
| break; | ||
| } | ||
|
|
||
| // Use slice comparison instead of constructing String | ||
| let pattern_slice = &chars[0..pattern_len]; | ||
| let mut count = 1; // First occurrence | ||
| let mut pos = pattern_len; | ||
|
|
||
| // Check for repetitions | ||
| while pos + pattern_len <= chars.len() && count < 3 { | ||
| let candidate_slice = &chars[pos..pos + pattern_len]; | ||
| // Compare slices directly (char comparison) | ||
| if pattern_slice == candidate_slice { | ||
| count += 1; | ||
| pos += pattern_len; | ||
| } else { | ||
| break; // Pattern broken, try next pattern length | ||
| } | ||
| } | ||
|
|
||
| if count >= 3 { | ||
| return 0.2; // Low confidence (repetitive pattern) | ||
| } | ||
| } |
There was a problem hiding this comment.
[nitpick] The repetition filter uses a nested loop with pattern matching that could be inefficient for very long strings. While there's a max_pattern_len cap at 16 characters, the algorithm still has O(n²) complexity in the worst case when checking multiple pattern lengths. For strings approaching the max_length threshold (200 characters default), this could cause performance degradation.
Consider adding early exit optimizations or reducing the max pattern length further if benchmarks show this is a bottleneck.
| pub encodings: Vec<Encoding>, | ||
| /// Whether to scan executable sections (default: true) | ||
| pub scan_code_sections: bool, | ||
| /// Whether to include debug sections (default: false) | ||
| pub include_debug: bool, | ||
| /// Section types to prioritize (default: StringData, ReadOnlyData, Resources) | ||
| pub section_priority: Vec<SectionType>, | ||
| /// Whether to include import/export names (default: true) | ||
| pub include_symbols: bool, | ||
| /// Minimum length for ASCII strings (default: 4, same as min_length) | ||
| pub min_ascii_length: usize, | ||
| /// Minimum length for UTF-16 strings (default: 3, for future use) | ||
| pub min_wide_length: usize, | ||
| /// Which encodings to extract (default: ASCII, UTF-8) | ||
| pub enabled_encodings: Vec<Encoding>, |
There was a problem hiding this comment.
The ExtractionConfig struct has both encodings and enabled_encodings fields (lines 139 and 153), which appear to serve the same purpose. This duplication creates confusion about which field should be used and requires checking both in the code (see lines 467-468 and 493-494 in mod.rs where both are checked with ||).
Recommend consolidating these into a single field, likely enabled_encodings, to simplify the API and reduce confusion.
…108) * feat(extraction): Introduce BasicExtractor and ExtractionConfig for string extraction - Added the BasicExtractor struct to implement a sequential ASCII/UTF-8 string extraction algorithm from binary data. - Introduced the ExtractionConfig struct to allow customization of extraction parameters, including minimum and maximum string lengths, encoding selection, and section filtering. - Updated documentation to include usage examples for both BasicExtractor and ExtractionConfig. - Added integration tests to validate the functionality of string extraction from various binary formats, including ELF and PE. This enhancement improves the library's ability to extract meaningful strings from binary files, facilitating better analysis and understanding of their content. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * feat(extraction): Add ASCII string extraction module and update documentation - Introduced a new module for ASCII string extraction, providing foundational functionality for extracting contiguous printable ASCII sequences from binary data. - Implemented `extract_ascii_strings` and `extract_from_section` functions, along with an `ExtractionConfig` struct for customizable extraction parameters. - Updated the main extraction module documentation to include details about the new ASCII extraction capabilities and usage examples. - Enhanced the existing documentation to reflect the addition of ASCII extraction in the library's structure and functionality. This addition improves the library's ability to handle ASCII string extraction, paving the way for future enhancements in UTF-8 and other encoding types. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * refactor(extraction): Rename ExtractionConfig to AsciiExtractionConfig and update related documentation - Renamed `ExtractionConfig` to `AsciiExtractionConfig` to better reflect its purpose in ASCII string extraction. - Updated all references in the codebase and documentation to use the new name, ensuring consistency. - Enhanced documentation to clarify usage examples and configuration details for ASCII extraction. - Improved function signatures and comments to align with the new naming convention. This refactor improves code clarity and maintains a consistent naming scheme across the extraction module, facilitating better understanding and usage of the ASCII extraction functionality. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * feat(extraction): Enhance ASCII string extraction with noise filtering and new benchmarks - Added a new `entropy` dependency to support advanced noise filtering capabilities in ASCII string extraction. - Introduced a new benchmark for ASCII extraction, `ascii_extraction`, to evaluate performance under various conditions. - Updated the `FoundString` struct to include a `confidence` field, representing the likelihood of a string being legitimate versus noise. - Enhanced the ASCII extraction functions to compute confidence scores using a composite noise filter, allowing for better filtering of false positives. - Updated documentation to reflect the new noise filtering features and provided examples for configuring extraction parameters. - Added unit and integration tests to validate the new noise filtering functionality and ensure robust performance. This enhancement significantly improves the accuracy and reliability of ASCII string extraction, making it more effective for analyzing binary data. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * chore(deps): Update dependencies in Cargo.toml - Bumped `entropy` to version 0.4.2 for improved functionality. - Updated `pelite` to version 0.10.0 for better compatibility. - Upgraded `serde_json` to version 1.0.145 to incorporate the latest features and fixes. - Updated `insta` to version 1.43.2 and `tempfile` to version 3.23.0 for enhanced testing capabilities. These updates ensure the project utilizes the latest versions of dependencies, improving overall stability and performance. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * chore(ci): Update GitHub Actions dependencies and configuration - Updated `actions/upload-artifact` from version 4 to 5 across multiple workflows for improved artifact management. - Updated `actions/download-artifact` from version 5 to 6 to leverage new features and enhancements. - Updated `actions/attest-build-provenance` from version 2 to 3 for better build provenance tracking. - Adjusted the `targets` list formatting in `dist-workspace.toml` for improved readability. These updates ensure the workflows are utilizing the latest versions of the actions, enhancing overall CI/CD performance and reliability. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> --------- Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
This pull request introduces a comprehensive benchmarking suite for ASCII string extraction and noise filtering, adds the
entropycrate as a dependency, and significantly expands the documentation to detail the filtering architecture, configuration, and usage. The most important changes are grouped below:Benchmarking and Performance Evaluation
benches/ascii_extraction.rsto measure the performance of ASCII string extraction and all individual noise filters, including entropy, character distribution, linguistic, length, repetition, context-aware, and composite filtering. Also benchmarks large binary extraction scenarios.ascii_extractioninCargo.tomlfor Criterion-based performance testing.Dependency Updates
entropycrate (version 0.4) toCargo.tomlto enable Shannon entropy calculations in both filtering logic and benchmarks.Documentation Improvements: Extraction & Filtering
docs/src/string-extraction.mdto clarify algorithm details, provide code examples, and explain configuration options for minimum/maximum length. Added a new section for UTF-8 extraction.These changes provide robust performance testing, enable advanced filtering, and make the documentation much more actionable for users and developers.