Skip to content

Implement ASCII String Noise Filtering with Intelligent Heuristics #10

@unclesp1d3r

Description

@unclesp1d3r

Summary

Implement intelligent heuristics to distinguish legitimate ASCII strings from binary noise, padding, and non-textual data during string extraction from binary files.

Problem Context

When extracting ASCII strings from binary files, the analyzer currently captures all byte sequences that technically qualify as ASCII characters. However, this leads to several quality issues:

  • Binary Noise: Random byte sequences that happen to fall within ASCII ranges but carry no semantic meaning
  • Padding Data: Repetitive null bytes, spaces, or other padding characters used for alignment
  • Table Data: Structured binary data (e.g., lookup tables, numerical arrays) that may contain ASCII-range values but aren't human-readable strings
  • Low Entropy Content: Sequences like repeated characters or patterns that don't represent meaningful text

Without proper filtering, these false positives pollute the extraction results, making it difficult for users to identify genuinely useful strings and significantly reducing the tool's practical value.

Proposed Solution

Implement a multi-layered heuristic filtering system in Rust:

1. Character Distribution Analysis

  • Calculate character frequency distributions
  • Flag strings with abnormal character ratios (e.g., >80% punctuation, all same character)
  • Apply entropy calculations to identify low-information content

2. Linguistic Patterns

  • Check for reasonable vowel-to-consonant ratios (for English-like strings)
  • Identify common word patterns and character bigrams/trigrams
  • Validate against common string patterns (paths, URLs, error messages)

3. Context-Aware Filtering

  • Leverage binary section information (.text, .data, .rodata, etc.)
  • Apply different heuristics based on section type
  • Consider string location and surrounding bytes

4. Configurable Thresholds

  • Allow users to adjust sensitivity levels
  • Provide options to include/exclude specific pattern types
  • Support allowlist/denylist patterns

5. Scoring System

  • Assign confidence scores to extracted strings
  • Allow filtering by minimum confidence threshold
  • Provide transparency into why strings were included/excluded

Acceptance Criteria

  • Implement character distribution analysis functions
  • Implement linguistic pattern detection
  • Integrate section context into filtering decisions
  • Add configurable threshold parameters
  • Implement confidence scoring system
  • Reduce false positive rate by >70% on test binaries
  • Maintain >95% true positive retention
  • Add comprehensive unit tests for each heuristic
  • Document heuristic algorithms and tuning parameters
  • Benchmark performance impact (<10% overhead target)

Technical Notes

  • Consider using the entropy crate for information theory calculations
  • Section context can be obtained from ELF/PE/Mach-O parsers (e.g., goblin crate)
  • Performance-critical code should be profiled and optimized
  • Heuristics should be modular to allow easy addition/removal

Requirements

1.4, 2.1

Dependencies

Task-ID

stringy-analyzer/ascii-noise-filtering

Metadata

Metadata

Assignees

Type

No fields configured for Task.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions