Skip to content

Extract Strings from PE Resource Directory (Version Info, Manifests, String Tables, Dialogs, Menus) #57

@unclesp1d3r

Description

@unclesp1d3r

Context

Windows PE (Portable Executable) binaries contain a .rsrc section with a hierarchical resource directory structure that holds rich metadata and embedded strings. These resources are currently not being extracted by Stringy, resulting in missed opportunities to identify:

  • Version Information: Product names, company names, file descriptions, copyright strings
  • Manifest Files: Embedded XML manifests with assembly information and dependencies
  • String Tables: Localized strings stored in STRINGTABLE resources
  • Dialog Boxes: Control text, window titles, and UI strings
  • Menu Definitions: Menu item text and tooltips

The PE resource directory follows a three-level tree structure:

  1. Type: Resource type (RT_VERSION, RT_MANIFEST, RT_STRING, RT_DIALOG, etc.)
  2. Name: Resource identifier (numeric ID or string name)
  3. Language: Language/locale identifier (LCID)

Resources commonly use UTF-16LE encoding, requiring proper string decoding beyond ASCII/UTF-8 extraction.

Proposed Solution

Implementation Approach

Extend the PE container parser (src/container/pe.rs) to parse the resource directory structure and extract strings from high-value resource types:

1. Resource Directory Parsing

  • Locate the .rsrc section using existing section enumeration
  • Parse the three-level resource directory tree (Type → Name → Language)
  • Navigate resource data entries to locate actual resource data
  • Handle both numeric and string resource identifiers

2. Resource Type Handlers

Implement extraction for the following Win32 resource types:

RT_VERSION (Type 16) - Version Information

  • Parse VS_VERSIONINFO structure
  • Extract StringFileInfo fields:
    • FileDescription
    • ProductName
    • CompanyName
    • LegalCopyright
    • FileVersion
    • ProductVersion
    • InternalName
    • OriginalFilename
  • Tag with Version and Resource tags
  • Assign high relevance scores (8-10)

RT_MANIFEST (Type 24) - Application Manifest

  • Extract embedded XML manifest
  • Parse manifest for assembly names, dependencies, and configuration strings
  • Tag with Manifest and Resource tags
  • Assign high relevance scores (7-9)

RT_STRING (Type 6) - String Tables

  • Parse STRINGTABLE blocks (16 strings per block)
  • Extract null-terminated UTF-16LE strings
  • Tag with Resource tag
  • Assign medium-high relevance scores (6-8)

RT_DIALOG (Type 5) - Dialog Box Templates

  • Parse DLGTEMPLATE or DLGITEMTEMPLATE structures
  • Extract window titles, control text, and tooltips
  • Handle both DIALOG and DIALOGEX formats
  • Tag with Resource tag
  • Assign medium relevance scores (5-7)

RT_MENU (Type 4) - Menu Resources

  • Parse MENUITEMTEMPLATE structures
  • Extract menu item text strings
  • Tag with Resource tag
  • Assign medium relevance scores (5-7)

3. String Encoding

  • Implement UTF-16LE decoding (resources typically use UTF-16)
  • Handle null-terminated wide strings
  • Convert to UTF-8 for storage in ExtractedString
  • Integrate with existing encoding detection if mixed encodings exist

4. Integration Points

  • Add resource extraction method to PeParser implementation
  • Call during container parsing in parse() method
  • Store extracted strings with StringSource::ResourceString
  • Apply semantic tags: Resource, Version, Manifest
  • Integrate into existing section weighting/scoring system

5. Library Consideration

Option A: Use pelite crate (specialized PE parser with resource support)

  • Pros: Handles resource parsing complexity, well-tested
  • Cons: Adds dependency, may need integration work

Option B: Extend current goblin-based parser manually

  • Pros: No new dependencies, full control
  • Cons: More complex implementation, need to handle edge cases

Recommendation: Start with Option B (manual parsing) leveraging existing goblin::pe::PE structures, add pelite later if complexity warrants.

Acceptance Criteria

  • Parse PE resource directory three-level tree structure (Type → Name → Language)
  • Extract version info fields (FileDescription, ProductName, CompanyName, etc.)
  • Extract manifest XML strings from embedded manifests
  • Handle STRINGTABLE resources with UTF-16LE decoding
  • Extract dialog box template strings (window titles, control text)
  • Extract menu definition strings
  • Apply appropriate semantic tags (Resource, Version, Manifest)
  • Include resource strings in scoring/ranking system with appropriate weights
  • Add unit tests for resource directory parsing
  • Add integration tests using real PE binaries with resources
  • Handle edge cases: missing resources, malformed structures, mixed encodings
  • Document resource extraction in docs/src/binary-formats.md

Implementation Notes

  • Resource extraction should be optional/non-blocking (gracefully handle parse failures)
  • Maintain performance: avoid deep recursion, limit resource tree depth
  • Consider adding CLI flag --resources to enable/disable resource extraction
  • Test with real-world Windows binaries (e.g., notepad.exe, calc.exe)

Related Issues

Note: This issue consolidates and supersedes #5 and #56 with a more comprehensive approach.

References

Metadata

Metadata

Assignees

Labels

No fields configured for Feature.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions