Implement ASCII String Noise Filtering with Intelligent Heuristics

## Summary

Implement intelligent heuristics to distinguish legitimate ASCII strings from binary noise, padding, and non-textual data during string extraction from binary files.

## Problem Context

When extracting ASCII strings from binary files, the analyzer currently captures all byte sequences that technically qualify as ASCII characters. However, this leads to several quality issues:

- **Binary Noise**: Random byte sequences that happen to fall within ASCII ranges but carry no semantic meaning
- **Padding Data**: Repetitive null bytes, spaces, or other padding characters used for alignment
- **Table Data**: Structured binary data (e.g., lookup tables, numerical arrays) that may contain ASCII-range values but aren't human-readable strings
- **Low Entropy Content**: Sequences like repeated characters or patterns that don't represent meaningful text

Without proper filtering, these false positives pollute the extraction results, making it difficult for users to identify genuinely useful strings and significantly reducing the tool's practical value.

## Proposed Solution

Implement a multi-layered heuristic filtering system in Rust:

### 1. **Character Distribution Analysis**
- Calculate character frequency distributions
- Flag strings with abnormal character ratios (e.g., >80% punctuation, all same character)
- Apply entropy calculations to identify low-information content

### 2. **Linguistic Patterns**
- Check for reasonable vowel-to-consonant ratios (for English-like strings)
- Identify common word patterns and character bigrams/trigrams
- Validate against common string patterns (paths, URLs, error messages)

### 3. **Context-Aware Filtering**
- Leverage binary section information (.text, .data, .rodata, etc.)
- Apply different heuristics based on section type
- Consider string location and surrounding bytes

### 4. **Configurable Thresholds**
- Allow users to adjust sensitivity levels
- Provide options to include/exclude specific pattern types
- Support allowlist/denylist patterns

### 5. **Scoring System**
- Assign confidence scores to extracted strings
- Allow filtering by minimum confidence threshold
- Provide transparency into why strings were included/excluded

## Acceptance Criteria

- [ ] Implement character distribution analysis functions
- [ ] Implement linguistic pattern detection
- [ ] Integrate section context into filtering decisions
- [ ] Add configurable threshold parameters
- [ ] Implement confidence scoring system
- [ ] Reduce false positive rate by >70% on test binaries
- [ ] Maintain >95% true positive retention
- [ ] Add comprehensive unit tests for each heuristic
- [ ] Document heuristic algorithms and tuning parameters
- [ ] Benchmark performance impact (<10% overhead target)

## Technical Notes

- Consider using the `entropy` crate for information theory calculations
- Section context can be obtained from ELF/PE/Mach-O parsers (e.g., `goblin` crate)
- Performance-critical code should be profiled and optimized
- Heuristics should be modular to allow easy addition/removal

## Requirements

1.4, 2.1

## Dependencies

- **Blocked by**: Basic ASCII String Extraction (#9)

## Task-ID

`stringy-analyzer/ascii-noise-filtering`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement ASCII String Noise Filtering with Intelligent Heuristics #10

Summary

Problem Context

Proposed Solution

1. Character Distribution Analysis

2. Linguistic Patterns

3. Context-Aware Filtering

4. Configurable Thresholds

5. Scoring System

Acceptance Criteria

Technical Notes

Requirements

Dependencies

Task-ID

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Implement ASCII String Noise Filtering with Intelligent Heuristics #10

Description

Summary

Problem Context

Proposed Solution

1. Character Distribution Analysis

2. Linguistic Patterns

3. Context-Aware Filtering

4. Configurable Thresholds

5. Scoring System

Acceptance Criteria

Technical Notes

Requirements

Dependencies

Task-ID

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions