Implement PE Resource String Extraction (VERSIONINFO, STRINGTABLE, Manifests)

## Summary

Extract strings from PE (Portable Executable) resource sections including VERSIONINFO, STRINGTABLE, and manifest resources. This feature will significantly improve string extraction quality from Windows executables by tapping into the rich metadata and localized string resources that PE files contain.

## Background

PE resources (stored in the `.rsrc` section) are a hierarchical data structure containing various resource types that are rich sources of meaningful strings:

- **VERSIONINFO (RT_VERSION)**: Product names, file descriptions, company names, copyright information, version strings
- **STRINGTABLE (RT_STRING)**: Localized UI strings, error messages, labels organized by language/locale
- **RT_MANIFEST**: Application manifests containing assembly information, compatibility settings, DPI awareness, and UAC requirements
- **Dialog Resources (RT_DIALOG)**: Future enhancement - control captions and labels

### PE Resource Structure

PE resources are organized as a three-level tree:
1. **Type**: Resource type (RT_VERSION, RT_STRING, RT_MANIFEST, etc.)
2. **Name/ID**: Resource identifier (numeric ID or string name)
3. **Language**: Language/locale identifier (LCID)

Each level uses a resource directory structure with entries pointing to either subdirectories or data descriptors containing the actual resource data.

### Why This Matters

- Windows APIs heavily favor UTF-16 encoding; resources are a primary source of UTF-16 strings
- Version information provides high-confidence metadata (company names, product names)
- String tables contain localized content, useful for malware analysis and application fingerprinting
- Manifests reveal application capabilities, dependencies, and security settings
- These strings are typically high-value and should receive strong ranking scores

## Current Implementation Status

✅ **Completed**:
- Basic PE parsing via `goblin` crate
- Section classification that identifies `.rsrc` sections
- `SectionType::Resources` enumeration
- `StringSource::ResourceString` type defined

❌ **Missing**:
- Resource directory tree parsing
- VERSIONINFO structure parsing and string extraction
- STRINGTABLE entry enumeration and text extraction
- Manifest XML parsing
- UTF-16LE decoding for resource strings

## Proposed Solution

### Implementation Approach

**Phase 1: Resource Directory Parsing**
1. Locate `.rsrc` section in PE file
2. Parse resource directory header and entry tables
3. Traverse three-level tree (Type → Name → Language)
4. Collect data descriptors for target resource types

**Phase 2: VERSIONINFO Extraction**
1. Identify RT_VERSION (type 16) resources
2. Parse VS_VERSIONINFO structure
3. Extract StringFileInfo blocks:
   - FileDescription
   - ProductName
   - CompanyName
   - LegalCopyright
   - FileVersion
   - ProductVersion
   - InternalName
   - OriginalFilename
4. Handle multiple language blocks if present

**Phase 3: STRINGTABLE Extraction**
1. Identify RT_STRING (type 6) resources
2. Enumerate string table blocks (grouped by ID)
3. Extract individual strings (each block contains up to 16 strings)
4. Associate strings with language IDs for locale awareness

**Phase 4: Manifest Parsing**
1. Identify RT_MANIFEST (type 24) resources
2. Extract XML content (typically UTF-8 or UTF-16)
3. Parse key XML elements:
   - Assembly identity (name, version, architecture)
   - Dependency information
   - Compatibility sections
   - Security settings (requestedExecutionLevel)

**Phase 5: Integration**
1. Add extracted strings to results with `StringSource::ResourceString`
2. Tag appropriately (version_info, string_table, manifest)
3. Apply high ranking scores (resources are high-confidence data)
4. Preserve resource type, ID, and language metadata

### Key Technical Considerations

- **Encoding**: Resources use UTF-16LE primarily; manifest may be UTF-8
- **RVA Translation**: Resource data descriptors use RVAs that must be translated to file offsets
- **Alignment**: Resource data is aligned; respect padding
- **Malformed Data**: Handle corrupted or malicious resource structures gracefully
- **goblin Limitations**: May need to manually parse resource structures if goblin's PE resource support is limited

### Dependencies

- **Blocked by**: PE Resource Extraction Foundation (assumed to provide resource directory traversal utilities)
- **goblin crate**: Already integrated for PE parsing
- **Potential additions**: 
  - `quick-xml` or similar for manifest parsing
  - Custom resource structure parsers if goblin insufficient

## Acceptance Criteria

- [ ] Extract all VERSIONINFO string fields from PE files
- [ ] Enumerate and extract strings from STRINGTABLE resources
- [ ] Parse and extract meaningful strings from RT_MANIFEST resources
- [ ] Handle multiple language/locale variants
- [ ] Correctly decode UTF-16LE resource strings
- [ ] Tag extracted strings with resource type and metadata
- [ ] Apply appropriate ranking scores (resources should score high)
- [ ] Handle malformed/corrupted resource structures without crashing
- [ ] Add unit tests with sample PE resource data
- [ ] Add integration tests with real PE binaries (e.g., `notepad.exe`, sample malware)

## Testing Strategy

1. **Unit Tests**:
   - Resource directory parsing with crafted binary blobs
   - VERSIONINFO structure parsing
   - STRINGTABLE entry extraction
   - Manifest XML parsing
   - UTF-16LE decoding edge cases

2. **Integration Tests**:
   - Real PE binaries (Windows system utilities)
   - Known malware samples (if available in test corpus)
   - PE files with multiple languages
   - PE files with missing or corrupted resources

3. **Regression Tests**:
   - Ensure non-resource string extraction still works
   - Verify performance doesn't degrade significantly

## Implementation Plan

1. Research goblin's PE resource support capabilities
2. Implement resource directory traversal utility
3. Implement VERSIONINFO parser
4. Implement STRINGTABLE parser
5. Implement manifest parser
6. Integrate with main extraction pipeline
7. Add comprehensive tests
8. Update documentation

## Related Issues

- Requirement version: 1.2
- Task-ID: stringy-analyzer/pe-resource-string-extraction

## References

- [PE Format Specification - Microsoft Learn](https://learn.microsoft.com/en-us/windows/win32/debug/pe-format)
- [Resource File Formats](https://learn.microsoft.com/en-us/windows/win32/menurc/resource-file-formats)
- [VERSIONINFO Structure](https://learn.microsoft.com/en-us/windows/win32/menurc/vs-versioninfo)
- [Application Manifests](https://learn.microsoft.com/en-us/windows/win32/sbscs/application-manifests)

@traycerai branch:3-implement-pe-section-classification-and-importexport-table-parsing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement PE Resource String Extraction (VERSIONINFO, STRINGTABLE, Manifests) #5

Summary

Background

PE Resource Structure

Why This Matters

Current Implementation Status

Proposed Solution

Implementation Approach

Key Technical Considerations

Dependencies

Acceptance Criteria

Testing Strategy

Implementation Plan

Related Issues

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Implement PE Resource String Extraction (VERSIONINFO, STRINGTABLE, Manifests) #5

Description

Summary

Background

PE Resource Structure

Why This Matters

Current Implementation Status

Proposed Solution

Implementation Approach

Key Technical Considerations

Dependencies

Acceptance Criteria

Testing Strategy

Implementation Plan

Related Issues

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions