Summary
Implement intelligent heuristics to distinguish legitimate ASCII strings from binary noise, padding, and non-textual data during string extraction from binary files.
Problem Context
When extracting ASCII strings from binary files, the analyzer currently captures all byte sequences that technically qualify as ASCII characters. However, this leads to several quality issues:
- Binary Noise: Random byte sequences that happen to fall within ASCII ranges but carry no semantic meaning
- Padding Data: Repetitive null bytes, spaces, or other padding characters used for alignment
- Table Data: Structured binary data (e.g., lookup tables, numerical arrays) that may contain ASCII-range values but aren't human-readable strings
- Low Entropy Content: Sequences like repeated characters or patterns that don't represent meaningful text
Without proper filtering, these false positives pollute the extraction results, making it difficult for users to identify genuinely useful strings and significantly reducing the tool's practical value.
Proposed Solution
Implement a multi-layered heuristic filtering system in Rust:
1. Character Distribution Analysis
- Calculate character frequency distributions
- Flag strings with abnormal character ratios (e.g., >80% punctuation, all same character)
- Apply entropy calculations to identify low-information content
2. Linguistic Patterns
- Check for reasonable vowel-to-consonant ratios (for English-like strings)
- Identify common word patterns and character bigrams/trigrams
- Validate against common string patterns (paths, URLs, error messages)
3. Context-Aware Filtering
- Leverage binary section information (.text, .data, .rodata, etc.)
- Apply different heuristics based on section type
- Consider string location and surrounding bytes
4. Configurable Thresholds
- Allow users to adjust sensitivity levels
- Provide options to include/exclude specific pattern types
- Support allowlist/denylist patterns
5. Scoring System
- Assign confidence scores to extracted strings
- Allow filtering by minimum confidence threshold
- Provide transparency into why strings were included/excluded
Acceptance Criteria
Technical Notes
- Consider using the
entropy crate for information theory calculations
- Section context can be obtained from ELF/PE/Mach-O parsers (e.g.,
goblin crate)
- Performance-critical code should be profiled and optimized
- Heuristics should be modular to allow easy addition/removal
Requirements
1.4, 2.1
Dependencies
Task-ID
stringy-analyzer/ascii-noise-filtering
Summary
Implement intelligent heuristics to distinguish legitimate ASCII strings from binary noise, padding, and non-textual data during string extraction from binary files.
Problem Context
When extracting ASCII strings from binary files, the analyzer currently captures all byte sequences that technically qualify as ASCII characters. However, this leads to several quality issues:
Without proper filtering, these false positives pollute the extraction results, making it difficult for users to identify genuinely useful strings and significantly reducing the tool's practical value.
Proposed Solution
Implement a multi-layered heuristic filtering system in Rust:
1. Character Distribution Analysis
2. Linguistic Patterns
3. Context-Aware Filtering
4. Configurable Thresholds
5. Scoring System
Acceptance Criteria
Technical Notes
entropycrate for information theory calculationsgoblincrate)Requirements
1.4, 2.1
Dependencies
Task-ID
stringy-analyzer/ascii-noise-filtering