Core String Extraction Framework: StringExtractor Trait and Configuration

## Background

StringyMcStringFace aims to be a smarter alternative to the `strings` command by leveraging format-specific knowledge to extract and prioritize meaningful strings from binaries. The container parsing layer (ELF, PE, Mach-O) is already implemented and provides structured metadata via `ContainerInfo`. Now we need the core extraction framework that will scan binary data and produce `FoundString` instances.

## Current State

- ✅ Container parsers implemented (`src/container/`)
- ✅ Type definitions complete (`FoundString`, `Encoding`, `SectionType`, etc.)
- ❌ String extraction logic missing (`src/extraction/mod.rs` is empty)
- ❌ No extraction configuration framework
- ❌ No trait abstraction for different extraction strategies

## Objectives

Create a flexible, trait-based extraction framework in `src/extraction/mod.rs` that:

1. **Defines the `StringExtractor` trait** - An interface for different extraction strategies (basic ASCII, Unicode, context-aware, etc.)
2. **Implements `ExtractionConfig`** - Configurable parameters for extraction behavior
3. **Provides a default extractor** - A basic implementation to get started

## Proposed Solution

### 1. StringExtractor Trait

```rust
/// Trait for implementing string extraction strategies
pub trait StringExtractor {
    /// Extract strings from binary data with metadata context
    fn extract(
        &self,
        data: &[u8],
        container_info: &ContainerInfo,
        config: &ExtractionConfig,
    ) -> Result<Vec<FoundString>>;
    
    /// Extract strings from a specific section
    fn extract_from_section(
        &self,
        data: &[u8],
        section: &SectionInfo,
        config: &ExtractionConfig,
    ) -> Result<Vec<FoundString>>;
}
```

### 2. ExtractionConfig Structure

```rust
/// Configuration parameters for string extraction
#[derive(Debug, Clone)]
pub struct ExtractionConfig {
    /// Minimum string length to consider
    pub min_length: usize,
    
    /// Maximum string length to extract
    pub max_length: usize,
    
    /// Encodings to search for
    pub encodings: Vec<Encoding>,
    
    /// Whether to extract from executable sections
    pub scan_code_sections: bool,
    
    /// Whether to include strings from debug sections
    pub include_debug: bool,
    
    /// Section types to prioritize
    pub section_priority: Vec<SectionType>,
    
    /// Whether to include import/export names
    pub include_symbols: bool,
}

impl Default for ExtractionConfig {
    fn default() -> Self {
        Self {
            min_length: 4,
            max_length: 4096,
            encodings: vec\![Encoding::Ascii, Encoding::Utf8],
            scan_code_sections: true,
            include_debug: false,
            section_priority: vec\![
                SectionType::StringData,
                SectionType::ReadOnlyData,
                SectionType::Resources,
            ],
            include_symbols: true,
        }
    }
}
```

### 3. Basic Implementation

Create a `BasicExtractor` that implements `StringExtractor` with straightforward sequential scanning for ASCII and UTF-8 strings.

## Architecture Integration

The extraction framework fits into the pipeline as follows:

```
Binary Data → Container Parser → ContainerInfo
                                        ↓
                                  StringExtractor + ExtractionConfig
                                        ↓
                                  Vec<FoundString>
                                        ↓
                                  Classification (future)
                                        ↓
                                  Output Formatting
```

## Implementation Details

### File Structure
- Main trait and config: `src/extraction/mod.rs`
- Basic extractor: `src/extraction/basic.rs` (or inline in mod.rs initially)
- Tests: Unit tests in each module, integration tests in `tests/`

### Key Considerations
- Use existing `FoundString` type (already defined in `src/types.rs`)
- Leverage `ContainerInfo` to make format-aware decisions
- Ensure zero-copy where possible (use slices and indices)
- Handle encoding detection robustly (invalid sequences should not panic)
- Set appropriate `StringSource` values based on where strings are found

## Acceptance Criteria

- [ ] `StringExtractor` trait defined with documented methods
- [ ] `ExtractionConfig` struct with sensible defaults
- [ ] At least one working implementation (`BasicExtractor`)
- [ ] Unit tests covering:
  - ASCII string extraction
  - UTF-8 string extraction  
  - Configuration validation
  - Edge cases (empty data, invalid encodings, boundary conditions)
- [ ] Documentation with usage examples
- [ ] Integration with existing `ContainerInfo` architecture

## Related

- **Requirement**: 2.1 (String Extraction Framework)
- **Task ID**: stringy-analyzer/string-extraction-framework
- **Blocks**: Classification and scoring functionality (future issues)
- **Repository**: https://github.com/EvilBit-Labs/StringyMcStringFace

## Notes

- The issue originally mentioned `RawString` struct, but `FoundString` already exists in `src/types.rs` and should be used instead
- Future extractors can implement format-specific logic (e.g., PE resource strings, Mach-O LC_ID_DYLIB)
- This is the foundation for the entire extraction pipeline - needs to be solid and extensible

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Core String Extraction Framework: StringExtractor Trait and Configuration #8

Background

Current State

Objectives

Proposed Solution

1. StringExtractor Trait

2. ExtractionConfig Structure

3. Basic Implementation

Architecture Integration

Implementation Details

File Structure

Key Considerations

Acceptance Criteria

Related

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Core String Extraction Framework: StringExtractor Trait and Configuration #8

Description

Background

Current State

Objectives

Proposed Solution

1. StringExtractor Trait

2. ExtractionConfig Structure

3. Basic Implementation

Architecture Integration

Implementation Details

File Structure

Key Considerations

Acceptance Criteria

Related

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions