Summary
Extract strings from PE (Portable Executable) resource sections including VERSIONINFO, STRINGTABLE, and manifest resources. This feature will significantly improve string extraction quality from Windows executables by tapping into the rich metadata and localized string resources that PE files contain.
Background
PE resources (stored in the .rsrc section) are a hierarchical data structure containing various resource types that are rich sources of meaningful strings:
- VERSIONINFO (RT_VERSION): Product names, file descriptions, company names, copyright information, version strings
- STRINGTABLE (RT_STRING): Localized UI strings, error messages, labels organized by language/locale
- RT_MANIFEST: Application manifests containing assembly information, compatibility settings, DPI awareness, and UAC requirements
- Dialog Resources (RT_DIALOG): Future enhancement - control captions and labels
PE Resource Structure
PE resources are organized as a three-level tree:
- Type: Resource type (RT_VERSION, RT_STRING, RT_MANIFEST, etc.)
- Name/ID: Resource identifier (numeric ID or string name)
- Language: Language/locale identifier (LCID)
Each level uses a resource directory structure with entries pointing to either subdirectories or data descriptors containing the actual resource data.
Why This Matters
- Windows APIs heavily favor UTF-16 encoding; resources are a primary source of UTF-16 strings
- Version information provides high-confidence metadata (company names, product names)
- String tables contain localized content, useful for malware analysis and application fingerprinting
- Manifests reveal application capabilities, dependencies, and security settings
- These strings are typically high-value and should receive strong ranking scores
Current Implementation Status
✅ Completed:
- Basic PE parsing via
goblin crate
- Section classification that identifies
.rsrc sections
SectionType::Resources enumeration
StringSource::ResourceString type defined
❌ Missing:
- Resource directory tree parsing
- VERSIONINFO structure parsing and string extraction
- STRINGTABLE entry enumeration and text extraction
- Manifest XML parsing
- UTF-16LE decoding for resource strings
Proposed Solution
Implementation Approach
Phase 1: Resource Directory Parsing
- Locate
.rsrc section in PE file
- Parse resource directory header and entry tables
- Traverse three-level tree (Type → Name → Language)
- Collect data descriptors for target resource types
Phase 2: VERSIONINFO Extraction
- Identify RT_VERSION (type 16) resources
- Parse VS_VERSIONINFO structure
- Extract StringFileInfo blocks:
- FileDescription
- ProductName
- CompanyName
- LegalCopyright
- FileVersion
- ProductVersion
- InternalName
- OriginalFilename
- Handle multiple language blocks if present
Phase 3: STRINGTABLE Extraction
- Identify RT_STRING (type 6) resources
- Enumerate string table blocks (grouped by ID)
- Extract individual strings (each block contains up to 16 strings)
- Associate strings with language IDs for locale awareness
Phase 4: Manifest Parsing
- Identify RT_MANIFEST (type 24) resources
- Extract XML content (typically UTF-8 or UTF-16)
- Parse key XML elements:
- Assembly identity (name, version, architecture)
- Dependency information
- Compatibility sections
- Security settings (requestedExecutionLevel)
Phase 5: Integration
- Add extracted strings to results with
StringSource::ResourceString
- Tag appropriately (version_info, string_table, manifest)
- Apply high ranking scores (resources are high-confidence data)
- Preserve resource type, ID, and language metadata
Key Technical Considerations
- Encoding: Resources use UTF-16LE primarily; manifest may be UTF-8
- RVA Translation: Resource data descriptors use RVAs that must be translated to file offsets
- Alignment: Resource data is aligned; respect padding
- Malformed Data: Handle corrupted or malicious resource structures gracefully
- goblin Limitations: May need to manually parse resource structures if goblin's PE resource support is limited
Dependencies
- Blocked by: PE Resource Extraction Foundation (assumed to provide resource directory traversal utilities)
- goblin crate: Already integrated for PE parsing
- Potential additions:
quick-xml or similar for manifest parsing
- Custom resource structure parsers if goblin insufficient
Acceptance Criteria
Testing Strategy
-
Unit Tests:
- Resource directory parsing with crafted binary blobs
- VERSIONINFO structure parsing
- STRINGTABLE entry extraction
- Manifest XML parsing
- UTF-16LE decoding edge cases
-
Integration Tests:
- Real PE binaries (Windows system utilities)
- Known malware samples (if available in test corpus)
- PE files with multiple languages
- PE files with missing or corrupted resources
-
Regression Tests:
- Ensure non-resource string extraction still works
- Verify performance doesn't degrade significantly
Implementation Plan
- Research goblin's PE resource support capabilities
- Implement resource directory traversal utility
- Implement VERSIONINFO parser
- Implement STRINGTABLE parser
- Implement manifest parser
- Integrate with main extraction pipeline
- Add comprehensive tests
- Update documentation
Related Issues
- Requirement version: 1.2
- Task-ID: stringy-analyzer/pe-resource-string-extraction
References
@traycerai branch:3-implement-pe-section-classification-and-importexport-table-parsing
Summary
Extract strings from PE (Portable Executable) resource sections including VERSIONINFO, STRINGTABLE, and manifest resources. This feature will significantly improve string extraction quality from Windows executables by tapping into the rich metadata and localized string resources that PE files contain.
Background
PE resources (stored in the
.rsrcsection) are a hierarchical data structure containing various resource types that are rich sources of meaningful strings:PE Resource Structure
PE resources are organized as a three-level tree:
Each level uses a resource directory structure with entries pointing to either subdirectories or data descriptors containing the actual resource data.
Why This Matters
Current Implementation Status
✅ Completed:
goblincrate.rsrcsectionsSectionType::ResourcesenumerationStringSource::ResourceStringtype defined❌ Missing:
Proposed Solution
Implementation Approach
Phase 1: Resource Directory Parsing
.rsrcsection in PE filePhase 2: VERSIONINFO Extraction
Phase 3: STRINGTABLE Extraction
Phase 4: Manifest Parsing
Phase 5: Integration
StringSource::ResourceStringKey Technical Considerations
Dependencies
quick-xmlor similar for manifest parsingAcceptance Criteria
notepad.exe, sample malware)Testing Strategy
Unit Tests:
Integration Tests:
Regression Tests:
Implementation Plan
Related Issues
References
@traycerai branch:3-implement-pe-section-classification-and-importexport-table-parsing