Extract Strings from PE Resource Directory (Version Info, Manifests, String Tables, Dialogs, Menus)

## Context

Windows PE (Portable Executable) binaries contain a `.rsrc` section with a hierarchical resource directory structure that holds rich metadata and embedded strings. These resources are currently not being extracted by Stringy, resulting in missed opportunities to identify:

- **Version Information**: Product names, company names, file descriptions, copyright strings
- **Manifest Files**: Embedded XML manifests with assembly information and dependencies
- **String Tables**: Localized strings stored in `STRINGTABLE` resources
- **Dialog Boxes**: Control text, window titles, and UI strings
- **Menu Definitions**: Menu item text and tooltips

The PE resource directory follows a three-level tree structure:
1. **Type**: Resource type (RT_VERSION, RT_MANIFEST, RT_STRING, RT_DIALOG, etc.)
2. **Name**: Resource identifier (numeric ID or string name)
3. **Language**: Language/locale identifier (LCID)

Resources commonly use UTF-16LE encoding, requiring proper string decoding beyond ASCII/UTF-8 extraction.

## Proposed Solution

### Implementation Approach

Extend the PE container parser (`src/container/pe.rs`) to parse the resource directory structure and extract strings from high-value resource types:

#### 1. Resource Directory Parsing

- Locate the `.rsrc` section using existing section enumeration
- Parse the three-level resource directory tree (Type → Name → Language)
- Navigate resource data entries to locate actual resource data
- Handle both numeric and string resource identifiers

#### 2. Resource Type Handlers

Implement extraction for the following Win32 resource types:

**RT_VERSION (Type 16)** - Version Information
- Parse `VS_VERSIONINFO` structure
- Extract `StringFileInfo` fields:
  - `FileDescription`
  - `ProductName`
  - `CompanyName`
  - `LegalCopyright`
  - `FileVersion`
  - `ProductVersion`
  - `InternalName`
  - `OriginalFilename`
- Tag with `Version` and `Resource` tags
- Assign high relevance scores (8-10)

**RT_MANIFEST (Type 24)** - Application Manifest
- Extract embedded XML manifest
- Parse manifest for assembly names, dependencies, and configuration strings
- Tag with `Manifest` and `Resource` tags
- Assign high relevance scores (7-9)

**RT_STRING (Type 6)** - String Tables
- Parse `STRINGTABLE` blocks (16 strings per block)
- Extract null-terminated UTF-16LE strings
- Tag with `Resource` tag
- Assign medium-high relevance scores (6-8)

**RT_DIALOG (Type 5)** - Dialog Box Templates
- Parse `DLGTEMPLATE` or `DLGITEMTEMPLATE` structures
- Extract window titles, control text, and tooltips
- Handle both `DIALOG` and `DIALOGEX` formats
- Tag with `Resource` tag
- Assign medium relevance scores (5-7)

**RT_MENU (Type 4)** - Menu Resources
- Parse `MENUITEMTEMPLATE` structures
- Extract menu item text strings
- Tag with `Resource` tag
- Assign medium relevance scores (5-7)

#### 3. String Encoding

- Implement UTF-16LE decoding (resources typically use UTF-16)
- Handle null-terminated wide strings
- Convert to UTF-8 for storage in `ExtractedString`
- Integrate with existing encoding detection if mixed encodings exist

#### 4. Integration Points

- Add resource extraction method to `PeParser` implementation
- Call during container parsing in `parse()` method
- Store extracted strings with `StringSource::ResourceString`
- Apply semantic tags: `Resource`, `Version`, `Manifest`
- Integrate into existing section weighting/scoring system

#### 5. Library Consideration

**Option A**: Use `pelite` crate (specialized PE parser with resource support)
- Pros: Handles resource parsing complexity, well-tested
- Cons: Adds dependency, may need integration work

**Option B**: Extend current `goblin`-based parser manually
- Pros: No new dependencies, full control
- Cons: More complex implementation, need to handle edge cases

**Recommendation**: Start with Option B (manual parsing) leveraging existing `goblin::pe::PE` structures, add `pelite` later if complexity warrants.

## Acceptance Criteria

- [ ] Parse PE resource directory three-level tree structure (Type → Name → Language)
- [ ] Extract version info fields (`FileDescription`, `ProductName`, `CompanyName`, etc.)
- [ ] Extract manifest XML strings from embedded manifests
- [ ] Handle `STRINGTABLE` resources with UTF-16LE decoding
- [ ] Extract dialog box template strings (window titles, control text)
- [ ] Extract menu definition strings
- [ ] Apply appropriate semantic tags (`Resource`, `Version`, `Manifest`)
- [ ] Include resource strings in scoring/ranking system with appropriate weights
- [ ] Add unit tests for resource directory parsing
- [ ] Add integration tests using real PE binaries with resources
- [ ] Handle edge cases: missing resources, malformed structures, mixed encodings
- [ ] Document resource extraction in `docs/src/binary-formats.md`

## Implementation Notes

- Resource extraction should be optional/non-blocking (gracefully handle parse failures)
- Maintain performance: avoid deep recursion, limit resource tree depth
- Consider adding CLI flag `--resources` to enable/disable resource extraction
- Test with real-world Windows binaries (e.g., `notepad.exe`, `calc.exe`)

## Related Issues

- #4 - Add pelite dependency for PE resource extraction
- #5 - Implement PE Resource String Extraction (VERSIONINFO, STRINGTABLE, Manifests)
- #56 - Extract Strings from PE Resource Directory

**Note**: This issue consolidates and supersedes #5 and #56 with a more comprehensive approach.

## References

- Parent Epic: [Epic: v0.2 - PE Resources, Symbol Demangling & Import/Export Enhancement](https://github.com/EvilBit-Labs/StringyMcStringFace/issues/40)
- [Microsoft PE Format Specification](https://learn.microsoft.com/en-us/windows/win32/debug/pe-format#the-rsrc-section)
- [Resource Types Documentation](https://learn.microsoft.com/en-us/windows/win32/menurc/resource-types)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract Strings from PE Resource Directory (Version Info, Manifests, String Tables, Dialogs, Menus) #57

Context

Proposed Solution

Implementation Approach

1. Resource Directory Parsing

2. Resource Type Handlers

3. String Encoding

4. Integration Points

5. Library Consideration

Acceptance Criteria

Implementation Notes

Related Issues

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Extract Strings from PE Resource Directory (Version Info, Manifests, String Tables, Dialogs, Menus) #57

Description

Context

Proposed Solution

Implementation Approach

1. Resource Directory Parsing

2. Resource Type Handlers

3. String Encoding

4. Integration Points

5. Library Consideration

Acceptance Criteria

Implementation Notes

Related Issues

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions