Summary
Create the main extraction orchestrator that serves as the primary API for analyzing binaries and extracting meaningful strings. This orchestrator will wire together all components (format detection, parsing, extraction, classification, and ranking) into a cohesive pipeline.
Background
StringyMcStringFace aims to be a smarter alternative to the standard strings command by being data-structure aware, section-aware, and semantically intelligent. The foundation is solid with working binary format parsers (PE, ELF, Mach-O) via goblin, but the core extraction pipeline that coordinates all components needs to be implemented.
Current Status:
- ✅ Format detection (
container::detect_format)
- ✅ Container parsers (
ContainerParser trait with PE/ELF/Mach-O implementations)
- ✅ Type definitions (
FoundString, ContainerInfo, etc.)
- ❌ String extraction engine (empty
extraction/mod.rs)
- ❌ Classification/tagging system (empty
classification/mod.rs)
- ❌ Ranking/scoring algorithm
- ❌ Main orchestrator API
Proposed Solution
Architecture
Create a StringAnalyzer orchestrator in src/lib.rs that provides a clean public API:
pub struct StringAnalyzer {
min_length: usize,
encodings: Vec<Encoding>,
// ... configuration
}
impl StringAnalyzer {
pub fn new() -> Self { /* ... */ }
pub fn analyze(&self, data: &[u8]) -> Result<Vec<FoundString>> {
// 1. Detect format
// 2. Parse container metadata
// 3. Extract strings from prioritized sections
// 4. Classify/tag strings
// 5. Score/rank strings
// 6. Return sorted results
}
}
Implementation Plan
Phase 1: Core Extraction (extraction/mod.rs)
- Implement
StringExtractor trait with methods:
extract_ascii_utf8(data: &[u8], offset: usize) -> Vec<FoundString>
extract_utf16le(data: &[u8], offset: usize) -> Vec<FoundString>
extract_utf16be(data: &[u8], offset: usize) -> Vec<FoundString>
- Section-aware extraction that respects
SectionInfo boundaries
- Configurable minimum length (default 4)
- Track source (section name, offset, RVA)
Phase 2: Classification System (classification/mod.rs)
- Implement
StringClassifier with pattern matching for:
- URLs (http://, https://, ftp://)
- Domains (DNS patterns)
- IP addresses (IPv4/IPv6)
- File paths (Unix:
/, Windows: C:\\, /)
- Registry keys (
HKEY_*)
- GUIDs (
{xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx})
- User agents
- Format strings (
%s, %d, {})
- Base64 patterns
- Crypto constants (known algorithm identifiers)
- Return
Vec<Tag> for each string
Phase 3: Ranking Algorithm
- Implement scoring system based on:
- Section type priority (.rodata/.rdata > .data > .text)
- Tag relevance (URL/domain/GUID = high, format string = medium)
- String length (longer = more informative, up to a point)
- Encoding confidence
- Import/export name bonus
- Score range: 0-100
Phase 4: Main Orchestrator
- Wire components together in
StringAnalyzer
- Proper error handling with context propagation
- Configuration options (min length, encoding filters, tag filters)
- Memory-efficient processing for large binaries
- Integration with
main.rs CLI
Pipeline Flow
Binary Input
↓
Format Detection (container::detect_format)
↓
Parser Creation (container::create_parser)
↓
Container Parsing (parser.parse())
↓
Section Prioritization (by SectionType)
↓
String Extraction (extraction module)
↓
Classification/Tagging (classification module)
↓
Scoring/Ranking
↓
Sorted Results (Vec<FoundString>)
Implementation Requirements
- Error Handling: Use
StringyError throughout with proper context
- Testing: Unit tests for each component, integration test for full pipeline
- Performance: Process large binaries (100MB+) efficiently
- Documentation: Doc comments for public API with examples
- Integration: Wire into
main.rs CLI to replace TODO
Dependencies
- Existing:
goblin, bstr, regex, serde
- May need: Pattern matching crates for classification
Acceptance Criteria
Related Issues
This is the main integration point that blocks most other features. Once complete, we can add:
- Output formatters (JSON, YARA, human-readable)
- Advanced filtering and search
- Performance optimizations
- Additional classification patterns
Task-ID: stringy-analyzer/main-extraction-pipeline
Summary
Create the main extraction orchestrator that serves as the primary API for analyzing binaries and extracting meaningful strings. This orchestrator will wire together all components (format detection, parsing, extraction, classification, and ranking) into a cohesive pipeline.
Background
StringyMcStringFace aims to be a smarter alternative to the standard
stringscommand by being data-structure aware, section-aware, and semantically intelligent. The foundation is solid with working binary format parsers (PE, ELF, Mach-O) via goblin, but the core extraction pipeline that coordinates all components needs to be implemented.Current Status:
container::detect_format)ContainerParsertrait with PE/ELF/Mach-O implementations)FoundString,ContainerInfo, etc.)extraction/mod.rs)classification/mod.rs)Proposed Solution
Architecture
Create a
StringAnalyzerorchestrator insrc/lib.rsthat provides a clean public API:Implementation Plan
Phase 1: Core Extraction (extraction/mod.rs)
StringExtractortrait with methods:extract_ascii_utf8(data: &[u8], offset: usize) -> Vec<FoundString>extract_utf16le(data: &[u8], offset: usize) -> Vec<FoundString>extract_utf16be(data: &[u8], offset: usize) -> Vec<FoundString>SectionInfoboundariesPhase 2: Classification System (classification/mod.rs)
StringClassifierwith pattern matching for:/, Windows:C:\\,/)HKEY_*){xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx})%s,%d,{})Vec<Tag>for each stringPhase 3: Ranking Algorithm
Phase 4: Main Orchestrator
StringAnalyzermain.rsCLIPipeline Flow
Implementation Requirements
StringyErrorthroughout with proper contextmain.rsCLI to replace TODODependencies
goblin,bstr,regex,serdeAcceptance Criteria
StringAnalyzerpublic API implemented insrc/lib.rsmain.rscalls orchestrator and displays resultsRelated Issues
This is the main integration point that blocks most other features. Once complete, we can add:
Task-ID: stringy-analyzer/main-extraction-pipeline