feat: Implement Mach-O load command extraction with section weight normalization and enhanced tagging#67
Conversation
…test - Updated the assertion for the resources field in the PE integration test to clarify that it may be None for minimal binaries, such as those compiled without resource files. - Removed redundant checks to streamline the test logic while maintaining clarity in the comments. This change enhances the readability of the test and ensures accurate expectations regarding resource availability in PE binaries. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Updated the section weight calculations in the Mach-O parser to use a normalized scale (0.0-1.0) for consistency with other formats. - Adjusted weights for various section types, including string data, read-only data, and code sections, to better reflect their likelihood of containing meaningful strings. - Enhanced section classification to include additional Objective-C related sections. - Added unit tests to validate the new weight calculations and classifications. This refactor improves the accuracy of string extraction from Mach-O binaries, aligning it with the established standards for ELF and PE formats. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Introduced a new module for extracting load command strings from Mach-O binaries, including library dependency paths and runtime search paths. - Enhanced the `Tag` enum with new variants for `DylibPath`, `Rpath`, `RpathVariable`, and `FrameworkPath` to support the new extraction functionality. - Updated the extraction module documentation with usage examples and detailed descriptions of the extraction process. - Added integration tests to validate the load command extraction functionality against a Mach-O fixture. This feature improves the ability to analyze Mach-O binaries by enabling the extraction of meaningful load command strings, which are crucial for understanding library dependencies and runtime behavior. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Added helper functions for extracting and sorting dylib paths, rpaths, and framework paths from load command strings. - Updated assertions in the Mach-O integration tests to verify the presence of exports and ensure correct tagging of load command strings. - Introduced snapshot tests for load command string extraction, providing a detailed breakdown of dylib paths, rpaths, and framework paths. - Enhanced documentation in the fixtures README to clarify the purpose and expected contents of the `test_binary_macho` fixture. These improvements strengthen the testing framework for Mach-O binaries, ensuring comprehensive validation of load command extraction and classification. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. Summary by CodeRabbit
WalkthroughNormalized Mach-O section weight scale to 0.0–1.0, expanded string-section classifications, added a new extractor module to pull dylib and rpath strings from Mach-O load commands, introduced four Tag variants, updated docs/fixtures, and added comprehensive Mach-O tests and minor PE test relaxation. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Caller
participant Extractor as extract_load_command_strings()
participant Goblin as goblin Parser
participant Classifier as classify_* helpers
participant Result as Vec<FoundString>
Caller->>Extractor: data: &[u8]
Extractor->>Goblin: parse (fat? -> extract arch -> parse single)
Goblin->>Goblin: iterate load commands
alt dylib load commands (LC_LOAD_DYLIB / _WEAK / _REEXPORT)
Goblin->>Classifier: classify_dylib_path(path)
Classifier-->>Goblin: tags [DylibPath, (FrameworkPath?)]
else rpath (LC_RPATH)
Goblin->>Classifier: classify_rpath(path)
Classifier-->>Goblin: tags [Rpath, (RpathVariable?), (FrameworkPath?)]
end
Goblin->>Result: build FoundString (text, tags, encoding, length, source)
Extractor-->>Caller: Vec<FoundString>
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25–30 minutes
Possibly related issues
Possibly related PRs
Suggested labels
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro Disabled knowledge base sources:
⛔ Files ignored due to path filters (1)
📒 Files selected for processing (3)
🧰 Additional context used🧬 Code graph analysis (2)tests/integration_macho.rs (1)
src/extraction/macho_load_commands.rs (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🔇 Additional comments (20)
Comment |
|
Documentation Updates 1 document(s) were updated by changes in this PR: stringy |
|
@coderabbitai Fix the PR title. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 500 🏁 Script executed: Length of output: 59 @unclesp1d3r: I've fixed the PR title. Old title: New title: The title now follows the conventional commit format (with |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/types.rs (1)
13-47: Flag:#[non_exhaustive]here breaks existing consumers
Tagis part of the public API. Flipping it to#[non_exhaustive]forces every downstream exhaustivematchto add a wildcard arm, so current consumers will stop compiling unless we ship a breaking release. Either coordinate a major version bump (even in 0.x terms) or hold off on this annotation so we don’t strand users mid-release.
🧹 Nitpick comments (2)
src/extraction/macho_load_commands.rs (1)
145-171: Also flag@-based dylib paths asRpathVariableDylib load commands often carry
@rpath/@loader_path/@executable_pathprefixes (e.g. framework stubs), butclassify_dylib_pathcurrently ignores them so those strings miss the newRpathVariabletag. Mirroring the check you have inclassify_rpathkeeps tagging consistent and makes it easier to surface the placeholders regardless of which load command they appear in.fn classify_dylib_path(path: &str) -> Vec<Tag> { - let mut tags = vec![Tag::DylibPath, Tag::FilePath]; + let mut tags = vec![Tag::DylibPath, Tag::FilePath]; + + if contains_rpath_variable(path) { + tags.push(Tag::RpathVariable); + } if is_framework_path(path) { tags.push(Tag::FrameworkPath); }tests/integration_macho.rs (1)
578-581: Simplify redundant logical condition.Since
system_librariesis a filtered subset ofdylib_paths, the condition!system_libraries.is_empty() || !dylib_paths.is_empty()is logically equivalent to just!dylib_paths.is_empty().- assert!( - !system_libraries.is_empty() || !dylib_paths.is_empty(), - "Should find at least some system libraries or dylib dependencies" - ); + assert!( + !dylib_paths.is_empty(), + "Should find at least some dylib dependencies" + );
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (1)
tests/snapshots/integration_macho__macho_load_command_strings.snapis excluded by!**/*.snap
📒 Files selected for processing (7)
src/container/macho.rs(7 hunks)src/extraction/macho_load_commands.rs(1 hunks)src/extraction/mod.rs(2 hunks)src/types.rs(2 hunks)tests/fixtures/README.md(3 hunks)tests/integration_macho.rs(4 hunks)tests/integration_pe.rs(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
src/extraction/macho_load_commands.rs (1)
src/container/macho.rs (2)
parse(291-298)extract_architecture_data(335-350)
tests/integration_macho.rs (1)
src/extraction/macho_load_commands.rs (1)
extract_load_command_strings(49-70)
src/container/macho.rs (2)
src/container/elf.rs (2)
classify_section(54-81)calculate_section_weight(25-51)src/container/pe.rs (2)
classify_section(139-178)calculate_section_weight(113-136)
src/extraction/mod.rs (1)
src/extraction/macho_load_commands.rs (1)
extract_load_command_strings(49-70)
🪛 LanguageTool
tests/fixtures/README.md
[uncategorized] ~8-~8: Possible missing comma found.
Context: ...64 Mach-O binary (contains typical load commands including LC_LOAD_DYLIB for system libr...
(AI_HYDRA_LEO_MISSING_COMMA)
🪛 markdownlint-cli2 (0.18.1)
tests/fixtures/README.md
59-59: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔍 Remote MCP Linear
Based on the gathered context, I have sufficient information to provide a comprehensive review summary. Let me compile the findings:
PR Review Context Summary
Key Changes Overview
This PR implements Mach-O load command string extraction and normalizes section weight scoring. The changes span four main areas:
1. Mach-O Section Weight Normalization ``
- Weight scale normalized from 1-10 to 0.0-1.0 across all section types
- String-related sections (
__cstring,__const,__cfstring) reduced proportionally (e.g., 10.0→1.0, 9.0→0.7) - New classifications for Objective-C sections (
__objc_methname,__objc_classname,__ustring) - Non-string data weights reduced significantly (ReadOnlyData: 7.0→0.4, Code: 1.0→0.1)
2. New Load Command Extraction Module ``
The PR adds src/extraction/macho_load_commands.rs with:
- Extraction of dylib paths from
LC_LOAD_DYLIB,LC_LOAD_WEAK_DYLIB,LC_REEXPORT_DYLIB - Runtime path extraction from
LC_RPATHcommands - Tagging system with
DylibPath,FilePath,FrameworkPath,Rpath,RpathVariablevariants - Support for framework detection and @-variable identification
3. Tag Enum Extension ``
New Tag enum variants added:
DylibPath(dylib-path),Rpath(rpath),RpathVariable(rpath-var),FrameworkPath(framework-path)- Enum marked as
#[non_exhaustive]for future extensibility - Serde serialization mappings for JSON output
4. Comprehensive Test Coverage
Integration tests validate:
- Load command extraction with source/encoding verification
- Tag integrity across DylibPath, Rpath, FrameworkPath classifications
- Framework and @-variable detection accuracy
- Empty/invalid input handling
- Metadata correctness (section=None, UTF-8 encoding, LoadCommand source)
Strategic Context
This work advances PROD-275 (MVP Pipeline Epic) by completing Mach-O-specific extraction capabilities. It directly implements requirements from PROD-298 (Load Command Parsing) and PROD-294 (Import/Export Integration). The section weight normalization aligns with foundation work in PROD-237-239 (ELF/PE section classification standards).
Integration Points
The PR integrates with existing infrastructure:
- Uses
goblincrate for Mach-O binary parsing - Extends
FoundStringdata structure for metadata tracking - Plugs into modular extraction pipeline architecture
- Compatible with section-based string extraction workflow,
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test-cross-platform (macos-latest, macOS)
- GitHub Check: test-cross-platform (windows-latest, Windows)
- GitHub Check: CodeQL Analyze
🔇 Additional comments (13)
src/container/macho.rs (1)
95-101: Comprehensive Objective-C coverage.Capturing
__objc_methname,__objc_classname, and__ustringasStringDatakeeps the classifier aligned with real-world string reservoirs—great attention to detail.tests/integration_pe.rs (1)
32-33: Good call-out on optional resourcesThe note makes the intent around
ContainerInfo::resourcesmuch clearer for the minimal PE fixture. Thanks for documenting that nuance.src/extraction/mod.rs (1)
14-38: Example update looks solidThe refreshed docs and example nicely surface the new Mach-O extractor alongside the existing PE helpers.
tests/integration_macho.rs (10)
12-44: Well-structured test helpers.The helper functions provide clean abstractions for filtering and sorting load command strings by tag type. The consistent pattern and deterministic sorting (by text) ensure reliable test assertions.
69-84: Appropriate relaxation of export assertions.The change from checking specific symbol names to verifying non-empty exports is sensible, as Mach-O executable symbol export behavior can vary. The inline comment clearly documents the rationale.
143-250: Comprehensive load command extraction test.This test thoroughly validates the extraction pipeline: source/encoding correctness, tag presence, framework path relationships, and rpath variable detection. The diagnostic output aids debugging and the assertion coverage is appropriate.
252-315: Effective snapshot test for regression protection.The snapshot test provides human-readable output with categorization (frameworks, @-variables) and count summaries. Limiting to 20 items per category with overflow indication balances detail with snapshot size.
317-383: Thorough tag relationship validation.This test validates critical invariants: DylibPath↔FilePath, RpathVariable↔Rpath, FrameworkPath↔(DylibPath|Rpath), and mutual exclusivity of DylibPath/Rpath. The comprehensive assertions with descriptive messages ensure tag integrity.
385-445: Bidirectional framework path validation.The test validates both positive and negative cases: strings with
.frameworkmust have FrameworkPath tags, and strings without must not. The categorization by dylib/rpath frameworks provides useful diagnostic context.
447-508: Comprehensive rpath variable detection test.The bidirectional validation ensures RpathVariable tags are correctly applied based on the presence of @-variables. The diagnostic output detailing which specific variables are found (@rpath, @executable_path, @loader_path) aids debugging.
510-526: Good defensive test for invalid input.This test ensures graceful error handling for both empty and invalid input, verifying the function returns an empty vector without panicking. Essential coverage for robustness.
528-582: Solid dylib path classification test.The test validates FilePath tag presence on all DylibPath strings and categorizes paths by type (system, framework, other). The categorization logic and diagnostic output are well-structured.
584-624: Thorough metadata validation test.This test validates critical metadata fields: section (None for load commands), length (matches text), source (LoadCommand), and encoding (UTF-8). The comment acknowledging unspecified offset/rva values allows for future enhancements without breaking existing tests.
There was a problem hiding this comment.
Pull Request Overview
This pull request implements comprehensive Mach-O load command string extraction with enhanced tagging and normalized section weight scoring. The changes add support for extracting library dependency paths and runtime search paths from Mach-O binaries, expand the tagging system with Mach-O-specific classifications, and normalize Mach-O section weights to a 0.0-1.0 scale (while ELF and PE still use 1-10).
Key Changes:
- Added
macho_load_commands.rsmodule withextract_load_command_strings()function to extract dylib paths (LC_LOAD_DYLIB, LC_LOAD_WEAK_DYLIB, LC_REEXPORT_DYLIB) and runtime search paths (LC_RPATH) from Mach-O load commands - Extended Tag enum with four new tags:
DylibPath,Rpath,RpathVariable, andFrameworkPath, marked enum as#[non_exhaustive]for future extensibility - Normalized Mach-O section weights to 0.0-1.0 scale and added classification for
__objc_methname,__objc_classname, and__ustringsections
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/extraction/macho_load_commands.rs | New module implementing load command string extraction with classification and tagging logic for dylib paths, rpaths, and framework paths |
| src/extraction/mod.rs | Updated module documentation and exports to expose extract_load_command_strings() function |
| src/types.rs | Added four new Tag variants for Mach-O load commands and marked Tag enum as non_exhaustive |
| src/container/macho.rs | Normalized section weights to 0.0-1.0 scale, added classification for additional Objective-C sections, updated tests |
| tests/integration_macho.rs | Added comprehensive test suite for load command extraction including snapshot tests, tag validation, and edge case handling |
| tests/integration_pe.rs | Removed tautological assertion checking resources field existence |
| tests/fixtures/README.md | Enhanced documentation describing Mach-O load command contents and build instructions |
| tests/snapshots/integration_macho__macho_load_command_strings.snap | Snapshot test output for Mach-O load command string extraction |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if offset + size <= data.len() { | ||
| Ok(&data[offset..offset + size]) |
There was a problem hiding this comment.
Potential integer overflow when adding offset + size. If both values are large, this could overflow before the comparison with data.len(). Use checked arithmetic:
let offset = arch.offset as usize;
let size = arch.size as usize;
if let Some(end) = offset.checked_add(size) {
if end <= data.len() {
return Ok(&data[offset..end]);
}
}
Err(())| if offset + size <= data.len() { | |
| Ok(&data[offset..offset + size]) | |
| if let Some(end) = offset.checked_add(size) { | |
| if end <= data.len() { | |
| Ok(&data[offset..end]) | |
| } else { | |
| Err(()) | |
| } |
|
@copilot open a new pull request to apply changes based on the comments in this thread |
|
@unclesp1d3r I've opened a new pull request, #103, to work on those changes. Once the pull request is ready, I'll request review from you. |
…103) * Initial plan * Apply code review fixes: error handling, integer overflow, formatting Co-authored-by: unclesp1d3r <251112+unclesp1d3r@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: unclesp1d3r <251112+unclesp1d3r@users.noreply.github.com>
…rmalization and enhanced tagging (#67) * refactor(tests): Simplify resource field assertion in PE integration test - Updated the assertion for the resources field in the PE integration test to clarify that it may be None for minimal binaries, such as those compiled without resource files. - Removed redundant checks to streamline the test logic while maintaining clarity in the comments. This change enhances the readability of the test and ensures accurate expectations regarding resource availability in PE binaries. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * refactor(macho): Normalize section weights for Mach-O format - Updated the section weight calculations in the Mach-O parser to use a normalized scale (0.0-1.0) for consistency with other formats. - Adjusted weights for various section types, including string data, read-only data, and code sections, to better reflect their likelihood of containing meaningful strings. - Enhanced section classification to include additional Objective-C related sections. - Added unit tests to validate the new weight calculations and classifications. This refactor improves the accuracy of string extraction from Mach-O binaries, aligning it with the established standards for ELF and PE formats. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * feat(macho): Add load command string extraction for Mach-O binaries - Introduced a new module for extracting load command strings from Mach-O binaries, including library dependency paths and runtime search paths. - Enhanced the `Tag` enum with new variants for `DylibPath`, `Rpath`, `RpathVariable`, and `FrameworkPath` to support the new extraction functionality. - Updated the extraction module documentation with usage examples and detailed descriptions of the extraction process. - Added integration tests to validate the load command extraction functionality against a Mach-O fixture. This feature improves the ability to analyze Mach-O binaries by enabling the extraction of meaningful load command strings, which are crucial for understanding library dependencies and runtime behavior. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * feat(tests): Enhance Mach-O load command extraction tests - Added helper functions for extracting and sorting dylib paths, rpaths, and framework paths from load command strings. - Updated assertions in the Mach-O integration tests to verify the presence of exports and ensure correct tagging of load command strings. - Introduced snapshot tests for load command string extraction, providing a detailed breakdown of dylib paths, rpaths, and framework paths. - Enhanced documentation in the fixtures README to clarify the purpose and expected contents of the `test_binary_macho` fixture. These improvements strengthen the testing framework for Mach-O binaries, ensuring comprehensive validation of load command extraction and classification. Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> * fix: Address code review feedback for Mach-O load command extraction (#103) * Initial plan * Apply code review fixes: error handling, integer overflow, formatting Co-authored-by: unclesp1d3r <251112+unclesp1d3r@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: unclesp1d3r <251112+unclesp1d3r@users.noreply.github.com> --------- Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
This pull request adds comprehensive support for extracting and classifying load command strings from Mach-O binaries, including library dependency paths and runtime search paths. It introduces a new extraction module, updates the tagging system to better classify extracted strings, and normalizes section weight scoring for Mach-O to a 0.0–1.0 scale for consistency. Extensive tests and documentation have been added to ensure correctness and ease of use.
Addresses issues #6 and #7
Mach-O Load Command Extraction:
src/extraction/macho_load_commands.rsto extract library paths (LC_LOAD_DYLIB,LC_LOAD_WEAK_DYLIB,LC_REEXPORT_DYLIB) and runtime search paths (LC_RPATH) from Mach-O load commands, with robust tagging and test coverage.extract_load_command_strings()and provided usage examples for Mach-O binaries. [1] [2]Tagging and Classification Improvements:
Tagenum insrc/types.rswith new tags:DylibPath,Rpath,RpathVariable, andFrameworkPath, and marked it as#[non_exhaustive]for future extensibility. [1] [2]Section Weight Normalization and Classification:
__objc_methname,__objc_classname,__ustring) as high-priority for extraction. [1] [2] [3]Documentation and Fixture Updates:
tests/fixtures/README.mdto describe Mach-O binary contents and provide instructions for generating test binaries with rpaths and framework dependencies. [1] [2]