Background
StringyMcStringFace aims to be a smarter alternative to the strings command by leveraging format-specific knowledge to extract and prioritize meaningful strings from binaries. The container parsing layer (ELF, PE, Mach-O) is already implemented and provides structured metadata via ContainerInfo. Now we need the core extraction framework that will scan binary data and produce FoundString instances.
Current State
- ✅ Container parsers implemented (
src/container/)
- ✅ Type definitions complete (
FoundString, Encoding, SectionType, etc.)
- ❌ String extraction logic missing (
src/extraction/mod.rs is empty)
- ❌ No extraction configuration framework
- ❌ No trait abstraction for different extraction strategies
Objectives
Create a flexible, trait-based extraction framework in src/extraction/mod.rs that:
- Defines the
StringExtractor trait - An interface for different extraction strategies (basic ASCII, Unicode, context-aware, etc.)
- Implements
ExtractionConfig - Configurable parameters for extraction behavior
- Provides a default extractor - A basic implementation to get started
Proposed Solution
1. StringExtractor Trait
/// Trait for implementing string extraction strategies
pub trait StringExtractor {
/// Extract strings from binary data with metadata context
fn extract(
&self,
data: &[u8],
container_info: &ContainerInfo,
config: &ExtractionConfig,
) -> Result<Vec<FoundString>>;
/// Extract strings from a specific section
fn extract_from_section(
&self,
data: &[u8],
section: &SectionInfo,
config: &ExtractionConfig,
) -> Result<Vec<FoundString>>;
}
2. ExtractionConfig Structure
/// Configuration parameters for string extraction
#[derive(Debug, Clone)]
pub struct ExtractionConfig {
/// Minimum string length to consider
pub min_length: usize,
/// Maximum string length to extract
pub max_length: usize,
/// Encodings to search for
pub encodings: Vec<Encoding>,
/// Whether to extract from executable sections
pub scan_code_sections: bool,
/// Whether to include strings from debug sections
pub include_debug: bool,
/// Section types to prioritize
pub section_priority: Vec<SectionType>,
/// Whether to include import/export names
pub include_symbols: bool,
}
impl Default for ExtractionConfig {
fn default() -> Self {
Self {
min_length: 4,
max_length: 4096,
encodings: vec\![Encoding::Ascii, Encoding::Utf8],
scan_code_sections: true,
include_debug: false,
section_priority: vec\![
SectionType::StringData,
SectionType::ReadOnlyData,
SectionType::Resources,
],
include_symbols: true,
}
}
}
3. Basic Implementation
Create a BasicExtractor that implements StringExtractor with straightforward sequential scanning for ASCII and UTF-8 strings.
Architecture Integration
The extraction framework fits into the pipeline as follows:
Binary Data → Container Parser → ContainerInfo
↓
StringExtractor + ExtractionConfig
↓
Vec<FoundString>
↓
Classification (future)
↓
Output Formatting
Implementation Details
File Structure
- Main trait and config:
src/extraction/mod.rs
- Basic extractor:
src/extraction/basic.rs (or inline in mod.rs initially)
- Tests: Unit tests in each module, integration tests in
tests/
Key Considerations
- Use existing
FoundString type (already defined in src/types.rs)
- Leverage
ContainerInfo to make format-aware decisions
- Ensure zero-copy where possible (use slices and indices)
- Handle encoding detection robustly (invalid sequences should not panic)
- Set appropriate
StringSource values based on where strings are found
Acceptance Criteria
Related
Notes
- The issue originally mentioned
RawString struct, but FoundString already exists in src/types.rs and should be used instead
- Future extractors can implement format-specific logic (e.g., PE resource strings, Mach-O LC_ID_DYLIB)
- This is the foundation for the entire extraction pipeline - needs to be solid and extensible
Background
StringyMcStringFace aims to be a smarter alternative to the
stringscommand by leveraging format-specific knowledge to extract and prioritize meaningful strings from binaries. The container parsing layer (ELF, PE, Mach-O) is already implemented and provides structured metadata viaContainerInfo. Now we need the core extraction framework that will scan binary data and produceFoundStringinstances.Current State
src/container/)FoundString,Encoding,SectionType, etc.)src/extraction/mod.rsis empty)Objectives
Create a flexible, trait-based extraction framework in
src/extraction/mod.rsthat:StringExtractortrait - An interface for different extraction strategies (basic ASCII, Unicode, context-aware, etc.)ExtractionConfig- Configurable parameters for extraction behaviorProposed Solution
1. StringExtractor Trait
2. ExtractionConfig Structure
3. Basic Implementation
Create a
BasicExtractorthat implementsStringExtractorwith straightforward sequential scanning for ASCII and UTF-8 strings.Architecture Integration
The extraction framework fits into the pipeline as follows:
Implementation Details
File Structure
src/extraction/mod.rssrc/extraction/basic.rs(or inline in mod.rs initially)tests/Key Considerations
FoundStringtype (already defined insrc/types.rs)ContainerInfoto make format-aware decisionsStringSourcevalues based on where strings are foundAcceptance Criteria
StringExtractortrait defined with documented methodsExtractionConfigstruct with sensible defaultsBasicExtractor)ContainerInfoarchitectureRelated
Notes
RawStringstruct, butFoundStringalready exists insrc/types.rsand should be used instead