Skip to content

Epic: v0.4 - Advanced Binary Analysis Features (DWARF, Mach-O Load Commands, Go Build Info) #42

@unclesp1d3r

Description

@unclesp1d3r

Overview

This epic tracks the development of three advanced binary analysis features for StringyMcStringFace v0.4, significantly expanding the tool's capabilities for extracting meaningful metadata and strings from different binary formats.

Motivation

While Stringy currently provides excellent string extraction from ELF, PE, and Mach-O binaries, there are rich sources of high-signal strings we're not yet tapping into:

  1. Debug symbols (DWARF) contain function names, variable names, type information, and file paths that are extremely valuable for analysis
  2. Mach-O load commands embed runtime paths, library dependencies, and framework references that standard section parsing misses
  3. Go binaries contain structured build metadata (version, module path, build settings) that can be reliably extracted

These features will make Stringy more comprehensive and competitive with specialized tools while maintaining its focus on high-signal, actionable output.


Feature 1: DWARF Debug Information Extraction

Description

Extract strings and metadata from DWARF debug sections (, , ) present in non-stripped binaries.

Value Proposition

  • Function and variable names: Surfaces developer-chosen identifiers even when symbols are otherwise unavailable
  • Source file paths: Reveals build environment and project structure
  • Type information: Exposes struct/class names and field names
  • High confidence: DWARF data is structured and reliable, not arbitrary byte runs

Proposed Approach

  1. Use the gimli crate for DWARF parsing (battle-tested, maintained by Rust debugging team)
  2. Create a new extraction source: ExtractionSource::DwarfDebug
  3. Target sections: .debug_info, .debug_str, .debug_line, .debug_abbrev
  4. Extract:
    • DW_AT_name attributes (function/variable names)
    • DW_AT_comp_dir and DW_AT_decl_file (source paths)
    • DW_TAG_structure_type and DW_TAG_class_type names
  5. Score highly (90+) since these are definitionally meaningful
  6. Tag with dwarf, symbol, filepath as appropriate

Implementation Considerations

  • DWARF parsing can be expensive; add a --skip-dwarf flag for performance-sensitive use cases
  • Some binaries have massive DWARF sections; consider size limits or sampling
  • Coordinate with deduplication logic to avoid duplicate strings from both DWARF and regular sections

Feature 2: Mach-O Load Command String Extraction

Description

Parse Mach-O load commands (LC_RPATH, LC_LOAD_DYLIB, LC_ID_DYLIB, etc.) to extract embedded library paths, framework references, and runtime search paths.

Value Proposition

  • Runtime dependencies: Surfaces dynamically linked libraries and frameworks
  • @rpath resolution: Exposes runtime search paths critical for understanding macOS/iOS binaries
  • Installation names: Reveals expected library locations
  • Mach-O specific: Standard string extraction misses these since they're in the header, not data sections

Proposed Approach

  1. Extend src/container/macho.rs to parse load commands
  2. Use goblin::mach::load_command APIs
  3. Target commands:
    • LC_LOAD_DYLIB, LC_LOAD_WEAK_DYLIB, LC_REEXPORT_DYLIB → dependency paths
    • LC_ID_DYLIB → library installation name
    • LC_RPATH → runtime search paths
    • LC_VERSION_MIN_* → platform version strings
  4. Create ExtractionSource::MachOLoadCommand
  5. Tag with macho-lc, filepath, framework, rpath
  6. Score at 85-90 (highly relevant for Mach-O analysis)

Implementation Considerations

  • Load commands are header-resident, so add a separate extraction pass before section scanning
  • Some paths may contain @executable_path, @loader_path, @rpath variables—preserve these
  • Consider a --macho-load-commands filter for targeted analysis

Feature 3: Go Build Info Detection

Description

Extract Go build metadata (Go version, module path, build settings, VCS info) from the .go.buildinfo section or embedded \xff Go build markers.

Value Proposition

  • Module identification: Reveals the Go module path (e.g., github.com/user/project)
  • Version tracking: Extracts Go compiler version and module versions
  • Build flags: Surfaces build tags, CGO settings, and GOEXPERIMENT flags
  • VCS metadata: Captures git commit hash, modified status, and build timestamp
  • Structured data: Go embeds this in a parseable format, not arbitrary strings

Proposed Approach

  1. Use the gobuild crate or implement the buildinfo format parser
  2. Locate build info:
    • In ELF/PE/Mach-O: Look for .go.buildinfo section or .gopclntab
    • Fallback: Scan for \xff Go build magic bytes
  3. Parse the buildinfo structure:
    • Go version (GoVersion)
    • Module path (Path)
    • Main module version (Main.Version)
    • Build settings (Settings: -tags, -compiler, etc.)
    • VCS info (revision, time, modified)
  4. Create ExtractionSource::GoBuildInfo
  5. Emit each field as a separate ExtractedString with tags: go-buildinfo, version, module, vcs
  6. Score at 95+ (extremely high signal for Go binaries)

Implementation Considerations

  • Go buildinfo format is stable across Go 1.18+
  • For older Go binaries, may need to fall back to pclntab parsing (more complex)
  • Add a --only go-buildinfo filter for quick Go binary identification
  • Consider a dedicated --go flag that enables all Go-specific features

Acceptance Criteria

  • DWARF: Extract function names, variable names, and source paths from DWARF sections using gimli
  • DWARF: Add --skip-dwarf flag for performance tuning
  • DWARF: Tag extracted strings appropriately (dwarf, symbol, filepath)
  • Mach-O: Parse LC_LOAD_DYLIB, LC_RPATH, and LC_ID_DYLIB load commands
  • Mach-O: Create ExtractionSource::MachOLoadCommand
  • Mach-O: Preserve @rpath, @executable_path, and @loader_path variables
  • Go: Detect and parse .go.buildinfo section or \xff Go build markers
  • Go: Extract module path, Go version, build settings, and VCS metadata
  • Go: Add dedicated filtering support (--only go-buildinfo)
  • Testing: Add test binaries for each format (stripped/unstripped ELF with DWARF, Mach-O with load commands, Go binaries)
  • Documentation: Update README with examples of new features
  • Performance: Ensure DWARF parsing doesn't significantly impact performance on large binaries

Dependencies

Crates to Add

  • gimli (~0.31) - DWARF parsing
  • gobuild (~0.1) or implement custom parser - Go buildinfo parsing

Existing Crates to Leverage

  • goblin - Already provides Mach-O load command APIs

Implementation Order

  1. Mach-O Load Commands (Lowest complexity, high value for macOS binaries)
  2. Go Build Info (Medium complexity, well-defined format)
  3. DWARF Extraction (Highest complexity, largest scope)

Related Issues

  • Depends on core extraction pipeline improvements
  • Coordinates with ranking system enhancements
  • May spawn sub-issues for each feature during implementation

Timeline

Target: Q1 2025 (aligns with v0.4 milestone)

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:analyzerBinary analyzer functionalityenhancementNew feature or requestepicLarge feature or initiative spanning multiple taskslang:rustRust implementationpriority:highHigh priority taskstory-points: 2121 story pointstype:enhancementNew feature or request

    Type

    No fields configured for Epic.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions