Skip to content

Implement PE Resource String Extraction (VERSIONINFO, STRINGTABLE, Manifests) #5

@unclesp1d3r

Description

@unclesp1d3r

Summary

Extract strings from PE (Portable Executable) resource sections including VERSIONINFO, STRINGTABLE, and manifest resources. This feature will significantly improve string extraction quality from Windows executables by tapping into the rich metadata and localized string resources that PE files contain.

Background

PE resources (stored in the .rsrc section) are a hierarchical data structure containing various resource types that are rich sources of meaningful strings:

  • VERSIONINFO (RT_VERSION): Product names, file descriptions, company names, copyright information, version strings
  • STRINGTABLE (RT_STRING): Localized UI strings, error messages, labels organized by language/locale
  • RT_MANIFEST: Application manifests containing assembly information, compatibility settings, DPI awareness, and UAC requirements
  • Dialog Resources (RT_DIALOG): Future enhancement - control captions and labels

PE Resource Structure

PE resources are organized as a three-level tree:

  1. Type: Resource type (RT_VERSION, RT_STRING, RT_MANIFEST, etc.)
  2. Name/ID: Resource identifier (numeric ID or string name)
  3. Language: Language/locale identifier (LCID)

Each level uses a resource directory structure with entries pointing to either subdirectories or data descriptors containing the actual resource data.

Why This Matters

  • Windows APIs heavily favor UTF-16 encoding; resources are a primary source of UTF-16 strings
  • Version information provides high-confidence metadata (company names, product names)
  • String tables contain localized content, useful for malware analysis and application fingerprinting
  • Manifests reveal application capabilities, dependencies, and security settings
  • These strings are typically high-value and should receive strong ranking scores

Current Implementation Status

Completed:

  • Basic PE parsing via goblin crate
  • Section classification that identifies .rsrc sections
  • SectionType::Resources enumeration
  • StringSource::ResourceString type defined

Missing:

  • Resource directory tree parsing
  • VERSIONINFO structure parsing and string extraction
  • STRINGTABLE entry enumeration and text extraction
  • Manifest XML parsing
  • UTF-16LE decoding for resource strings

Proposed Solution

Implementation Approach

Phase 1: Resource Directory Parsing

  1. Locate .rsrc section in PE file
  2. Parse resource directory header and entry tables
  3. Traverse three-level tree (Type → Name → Language)
  4. Collect data descriptors for target resource types

Phase 2: VERSIONINFO Extraction

  1. Identify RT_VERSION (type 16) resources
  2. Parse VS_VERSIONINFO structure
  3. Extract StringFileInfo blocks:
    • FileDescription
    • ProductName
    • CompanyName
    • LegalCopyright
    • FileVersion
    • ProductVersion
    • InternalName
    • OriginalFilename
  4. Handle multiple language blocks if present

Phase 3: STRINGTABLE Extraction

  1. Identify RT_STRING (type 6) resources
  2. Enumerate string table blocks (grouped by ID)
  3. Extract individual strings (each block contains up to 16 strings)
  4. Associate strings with language IDs for locale awareness

Phase 4: Manifest Parsing

  1. Identify RT_MANIFEST (type 24) resources
  2. Extract XML content (typically UTF-8 or UTF-16)
  3. Parse key XML elements:
    • Assembly identity (name, version, architecture)
    • Dependency information
    • Compatibility sections
    • Security settings (requestedExecutionLevel)

Phase 5: Integration

  1. Add extracted strings to results with StringSource::ResourceString
  2. Tag appropriately (version_info, string_table, manifest)
  3. Apply high ranking scores (resources are high-confidence data)
  4. Preserve resource type, ID, and language metadata

Key Technical Considerations

  • Encoding: Resources use UTF-16LE primarily; manifest may be UTF-8
  • RVA Translation: Resource data descriptors use RVAs that must be translated to file offsets
  • Alignment: Resource data is aligned; respect padding
  • Malformed Data: Handle corrupted or malicious resource structures gracefully
  • goblin Limitations: May need to manually parse resource structures if goblin's PE resource support is limited

Dependencies

  • Blocked by: PE Resource Extraction Foundation (assumed to provide resource directory traversal utilities)
  • goblin crate: Already integrated for PE parsing
  • Potential additions:
    • quick-xml or similar for manifest parsing
    • Custom resource structure parsers if goblin insufficient

Acceptance Criteria

  • Extract all VERSIONINFO string fields from PE files
  • Enumerate and extract strings from STRINGTABLE resources
  • Parse and extract meaningful strings from RT_MANIFEST resources
  • Handle multiple language/locale variants
  • Correctly decode UTF-16LE resource strings
  • Tag extracted strings with resource type and metadata
  • Apply appropriate ranking scores (resources should score high)
  • Handle malformed/corrupted resource structures without crashing
  • Add unit tests with sample PE resource data
  • Add integration tests with real PE binaries (e.g., notepad.exe, sample malware)

Testing Strategy

  1. Unit Tests:

    • Resource directory parsing with crafted binary blobs
    • VERSIONINFO structure parsing
    • STRINGTABLE entry extraction
    • Manifest XML parsing
    • UTF-16LE decoding edge cases
  2. Integration Tests:

    • Real PE binaries (Windows system utilities)
    • Known malware samples (if available in test corpus)
    • PE files with multiple languages
    • PE files with missing or corrupted resources
  3. Regression Tests:

    • Ensure non-resource string extraction still works
    • Verify performance doesn't degrade significantly

Implementation Plan

  1. Research goblin's PE resource support capabilities
  2. Implement resource directory traversal utility
  3. Implement VERSIONINFO parser
  4. Implement STRINGTABLE parser
  5. Implement manifest parser
  6. Integrate with main extraction pipeline
  7. Add comprehensive tests
  8. Update documentation

Related Issues

  • Requirement version: 1.2
  • Task-ID: stringy-analyzer/pe-resource-string-extraction

References

@traycerai branch:3-implement-pe-section-classification-and-importexport-table-parsing

Metadata

Metadata

Assignees

Type

No fields configured for Task.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions