Skip to content

Core String Extraction Framework: StringExtractor Trait and Configuration #8

@unclesp1d3r

Description

@unclesp1d3r

Background

StringyMcStringFace aims to be a smarter alternative to the strings command by leveraging format-specific knowledge to extract and prioritize meaningful strings from binaries. The container parsing layer (ELF, PE, Mach-O) is already implemented and provides structured metadata via ContainerInfo. Now we need the core extraction framework that will scan binary data and produce FoundString instances.

Current State

  • ✅ Container parsers implemented (src/container/)
  • ✅ Type definitions complete (FoundString, Encoding, SectionType, etc.)
  • ❌ String extraction logic missing (src/extraction/mod.rs is empty)
  • ❌ No extraction configuration framework
  • ❌ No trait abstraction for different extraction strategies

Objectives

Create a flexible, trait-based extraction framework in src/extraction/mod.rs that:

  1. Defines the StringExtractor trait - An interface for different extraction strategies (basic ASCII, Unicode, context-aware, etc.)
  2. Implements ExtractionConfig - Configurable parameters for extraction behavior
  3. Provides a default extractor - A basic implementation to get started

Proposed Solution

1. StringExtractor Trait

/// Trait for implementing string extraction strategies
pub trait StringExtractor {
    /// Extract strings from binary data with metadata context
    fn extract(
        &self,
        data: &[u8],
        container_info: &ContainerInfo,
        config: &ExtractionConfig,
    ) -> Result<Vec<FoundString>>;
    
    /// Extract strings from a specific section
    fn extract_from_section(
        &self,
        data: &[u8],
        section: &SectionInfo,
        config: &ExtractionConfig,
    ) -> Result<Vec<FoundString>>;
}

2. ExtractionConfig Structure

/// Configuration parameters for string extraction
#[derive(Debug, Clone)]
pub struct ExtractionConfig {
    /// Minimum string length to consider
    pub min_length: usize,
    
    /// Maximum string length to extract
    pub max_length: usize,
    
    /// Encodings to search for
    pub encodings: Vec<Encoding>,
    
    /// Whether to extract from executable sections
    pub scan_code_sections: bool,
    
    /// Whether to include strings from debug sections
    pub include_debug: bool,
    
    /// Section types to prioritize
    pub section_priority: Vec<SectionType>,
    
    /// Whether to include import/export names
    pub include_symbols: bool,
}

impl Default for ExtractionConfig {
    fn default() -> Self {
        Self {
            min_length: 4,
            max_length: 4096,
            encodings: vec\![Encoding::Ascii, Encoding::Utf8],
            scan_code_sections: true,
            include_debug: false,
            section_priority: vec\![
                SectionType::StringData,
                SectionType::ReadOnlyData,
                SectionType::Resources,
            ],
            include_symbols: true,
        }
    }
}

3. Basic Implementation

Create a BasicExtractor that implements StringExtractor with straightforward sequential scanning for ASCII and UTF-8 strings.

Architecture Integration

The extraction framework fits into the pipeline as follows:

Binary Data → Container Parser → ContainerInfo
                                        ↓
                                  StringExtractor + ExtractionConfig
                                        ↓
                                  Vec<FoundString>
                                        ↓
                                  Classification (future)
                                        ↓
                                  Output Formatting

Implementation Details

File Structure

  • Main trait and config: src/extraction/mod.rs
  • Basic extractor: src/extraction/basic.rs (or inline in mod.rs initially)
  • Tests: Unit tests in each module, integration tests in tests/

Key Considerations

  • Use existing FoundString type (already defined in src/types.rs)
  • Leverage ContainerInfo to make format-aware decisions
  • Ensure zero-copy where possible (use slices and indices)
  • Handle encoding detection robustly (invalid sequences should not panic)
  • Set appropriate StringSource values based on where strings are found

Acceptance Criteria

  • StringExtractor trait defined with documented methods
  • ExtractionConfig struct with sensible defaults
  • At least one working implementation (BasicExtractor)
  • Unit tests covering:
    • ASCII string extraction
    • UTF-8 string extraction
    • Configuration validation
    • Edge cases (empty data, invalid encodings, boundary conditions)
  • Documentation with usage examples
  • Integration with existing ContainerInfo architecture

Related

Notes

  • The issue originally mentioned RawString struct, but FoundString already exists in src/types.rs and should be used instead
  • Future extractors can implement format-specific logic (e.g., PE resource strings, Mach-O LC_ID_DYLIB)
  • This is the foundation for the entire extraction pipeline - needs to be solid and extensible

Metadata

Metadata

Assignees

No fields configured for Feature.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions