Skip to content

Implement ASCII String Extractor with Configurable Length Filtering #9

@unclesp1d3r

Description

@unclesp1d3r

Summary

Implement the foundational ASCII string extraction module that scans binary data for printable ASCII character sequences (0x20-0x7E) and returns them as FoundString objects with configurable minimum length filtering.

Context

ASCII extraction is the foundational encoding type for StringyMcStringFace's string extraction pipeline. Unlike the traditional strings command which blindly extracts all printable sequences, this implementation will:

  • Be section-aware (integrate with SectionInfo from the container parsing)
  • Return properly structured FoundString objects with metadata (offset, RVA, section name, encoding type)
  • Support configurable minimum length to reduce noise from random byte sequences
  • Serve as the reference implementation for future encodings (UTF-8, UTF-16LE, UTF-16BE)

The ASCII extractor will be the first concrete implementation of the string extraction framework and will be used by all binary formats (ELF, PE, Mach-O).

Requirements

  • Requirement 2.1: Implement basic string extraction for ASCII encoding
  • Must scan byte sequences for contiguous printable ASCII characters (0x20-0x7E)
  • Must support configurable minimum length threshold (default: 4 characters)
  • Must return FoundString objects as defined in src/types.rs
  • Must properly populate metadata fields: offset, section name, encoding, length, source
  • Must handle section boundaries correctly (don't span strings across sections)

Proposed Solution

File Structure

Create src/extraction/ascii.rs with the following components:

Core Functions

  1. extract_ascii_strings(data: &[u8], config: &ExtractionConfig) -> Vec<FoundString>

    • Main extraction function
    • Scans byte slice for printable ASCII runs
    • Filters by minimum length
    • Returns vector of FoundString objects
  2. is_printable_ascii(byte: u8) -> bool

    • Helper to check if byte is in printable range (0x20-0x7E)
    • Inline for performance
  3. extract_from_section(section: &SectionInfo, data: &[u8], config: &ExtractionConfig) -> Vec<FoundString>

    • Section-aware extraction wrapper
    • Calculates correct offsets and RVAs
    • Populates section metadata

Configuration Structure

pub struct ExtractionConfig {
    pub min_length: usize,
    pub max_length: Option<usize>,
    // Future: encoding preferences, tag filters, etc.
}

Algorithm

  1. Iterate through byte slice
  2. Track current string start position and length
  3. When encountering non-printable byte:
    • If accumulated length >= min_length, create FoundString
    • Reset accumulator
  4. Handle end-of-buffer edge case
  5. Calculate offsets (file offset + buffer start)
  6. Set encoding to Encoding::Ascii
  7. Set source to StringSource::SectionData

Edge Cases to Handle

  • Empty sections or zero-length data
  • Strings at section boundaries
  • Very long continuous runs (potential padding or data tables)
  • Null terminators within printable sequences
  • Sections smaller than minimum length
  • Buffer boundaries

Acceptance Criteria

  • src/extraction/ascii.rs created with extraction logic
  • Configurable minimum length parameter (default: 4)
  • Correctly identifies printable ASCII range (0x20-0x7E)
  • Returns FoundString objects with all required fields populated
  • Unit tests covering:
    • Basic extraction with default minimum length
    • Custom minimum length filtering
    • Edge case: empty input
    • Edge case: no strings found
    • Edge case: string at buffer start
    • Edge case: string at buffer end
    • Edge case: single character (below minimum)
    • Edge case: exact minimum length string
    • Offset calculation correctness
    • Section metadata population
  • Documentation with examples
  • Integrated into src/extraction/mod.rs

Implementation Notes

  • Start with simple implementation; optimize later if profiling shows bottlenecks
  • Consider using SIMD or vectorization in future iterations for performance
  • ASCII extraction should be the reference for implementing UTF-8, UTF-16LE, UTF-16BE
  • Do not implement semantic tagging yet (that's a separate issue)
  • Do not implement scoring yet (that's a separate issue)

Dependencies

Related Issues

Definition of Done

  • Code passes cargo test
  • Code passes cargo clippy with no warnings
  • Unit test coverage >= 80%
  • Module properly exported in extraction/mod.rs
  • Inline documentation for public API
  • Ready for integration with container parsers

Task-ID: stringy-analyzer/basic-ascii-string-extraction

Metadata

Metadata

Assignees

Type

No fields configured for Task.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions