Summary
Implement the foundational ASCII string extraction module that scans binary data for printable ASCII character sequences (0x20-0x7E) and returns them as FoundString objects with configurable minimum length filtering.
Context
ASCII extraction is the foundational encoding type for StringyMcStringFace's string extraction pipeline. Unlike the traditional strings command which blindly extracts all printable sequences, this implementation will:
- Be section-aware (integrate with
SectionInfo from the container parsing)
- Return properly structured
FoundString objects with metadata (offset, RVA, section name, encoding type)
- Support configurable minimum length to reduce noise from random byte sequences
- Serve as the reference implementation for future encodings (UTF-8, UTF-16LE, UTF-16BE)
The ASCII extractor will be the first concrete implementation of the string extraction framework and will be used by all binary formats (ELF, PE, Mach-O).
Requirements
- Requirement 2.1: Implement basic string extraction for ASCII encoding
- Must scan byte sequences for contiguous printable ASCII characters (0x20-0x7E)
- Must support configurable minimum length threshold (default: 4 characters)
- Must return
FoundString objects as defined in src/types.rs
- Must properly populate metadata fields: offset, section name, encoding, length, source
- Must handle section boundaries correctly (don't span strings across sections)
Proposed Solution
File Structure
Create src/extraction/ascii.rs with the following components:
Core Functions
-
extract_ascii_strings(data: &[u8], config: &ExtractionConfig) -> Vec<FoundString>
- Main extraction function
- Scans byte slice for printable ASCII runs
- Filters by minimum length
- Returns vector of FoundString objects
-
is_printable_ascii(byte: u8) -> bool
- Helper to check if byte is in printable range (0x20-0x7E)
- Inline for performance
-
extract_from_section(section: &SectionInfo, data: &[u8], config: &ExtractionConfig) -> Vec<FoundString>
- Section-aware extraction wrapper
- Calculates correct offsets and RVAs
- Populates section metadata
Configuration Structure
pub struct ExtractionConfig {
pub min_length: usize,
pub max_length: Option<usize>,
// Future: encoding preferences, tag filters, etc.
}
Algorithm
- Iterate through byte slice
- Track current string start position and length
- When encountering non-printable byte:
- If accumulated length >= min_length, create FoundString
- Reset accumulator
- Handle end-of-buffer edge case
- Calculate offsets (file offset + buffer start)
- Set encoding to
Encoding::Ascii
- Set source to
StringSource::SectionData
Edge Cases to Handle
- Empty sections or zero-length data
- Strings at section boundaries
- Very long continuous runs (potential padding or data tables)
- Null terminators within printable sequences
- Sections smaller than minimum length
- Buffer boundaries
Acceptance Criteria
Implementation Notes
- Start with simple implementation; optimize later if profiling shows bottlenecks
- Consider using SIMD or vectorization in future iterations for performance
- ASCII extraction should be the reference for implementing UTF-8, UTF-16LE, UTF-16BE
- Do not implement semantic tagging yet (that's a separate issue)
- Do not implement scoring yet (that's a separate issue)
Dependencies
Related Issues
Definition of Done
- Code passes
cargo test
- Code passes
cargo clippy with no warnings
- Unit test coverage >= 80%
- Module properly exported in
extraction/mod.rs
- Inline documentation for public API
- Ready for integration with container parsers
Task-ID: stringy-analyzer/basic-ascii-string-extraction
Summary
Implement the foundational ASCII string extraction module that scans binary data for printable ASCII character sequences (0x20-0x7E) and returns them as
FoundStringobjects with configurable minimum length filtering.Context
ASCII extraction is the foundational encoding type for StringyMcStringFace's string extraction pipeline. Unlike the traditional
stringscommand which blindly extracts all printable sequences, this implementation will:SectionInfofrom the container parsing)FoundStringobjects with metadata (offset, RVA, section name, encoding type)The ASCII extractor will be the first concrete implementation of the string extraction framework and will be used by all binary formats (ELF, PE, Mach-O).
Requirements
FoundStringobjects as defined insrc/types.rsProposed Solution
File Structure
Create
src/extraction/ascii.rswith the following components:Core Functions
extract_ascii_strings(data: &[u8], config: &ExtractionConfig) -> Vec<FoundString>is_printable_ascii(byte: u8) -> boolextract_from_section(section: &SectionInfo, data: &[u8], config: &ExtractionConfig) -> Vec<FoundString>Configuration Structure
Algorithm
Encoding::AsciiStringSource::SectionDataEdge Cases to Handle
Acceptance Criteria
src/extraction/ascii.rscreated with extraction logicFoundStringobjects with all required fields populatedsrc/extraction/mod.rsImplementation Notes
Dependencies
Related Issues
Definition of Done
cargo testcargo clippywith no warningsextraction/mod.rsTask-ID: stringy-analyzer/basic-ascii-string-extraction