Skip to content

Implement comprehensive text-based magic file parser #11

@unclesp1d3r

Description

@unclesp1d3r

Overview

Implement a complete text-based magic file parser that reads entire files and converts them into a hierarchical tree of MagicRule structures. This is a critical component for Phase 1 MVP completion, as it bridges the gap between existing parser components (offsets, types, operators, values) and the evaluator engine.

Background

The project has completed core parsing components in src/parser/grammar.rs:

  • parse_number - Parses decimal, hex, and octal numbers
  • parse_offset - Parses offset specifications (absolute, indirect, relative)
  • parse_operator - Parses comparison operators (=, !=, <, >, &)
  • parse_value - Parses values (strings, numbers, byte sequences)

The AST structures in src/parser/ast.rs are also complete with full serialization support.

What's Missing: A higher-level parser that orchestrates these components to parse complete magic files line-by-line, handling:

  • File-level structure and organization
  • Line continuation and comments
  • Hierarchical rule nesting based on indentation
  • Error reporting with line numbers
  • Special directives (!:mime, !:strength, etc.)

Magic File Format Reference

Magic files follow this structure:

# Comment lines start with #
offset  type  operator  value  message

# Example: ELF file detection
0       string    \x7fELF         ELF
>4      byte      1               32-bit
>4      byte      2               64-bit
>>16    leshort   >0              executable

# Continuation lines end with backslash\
0       string    PK\003\004     ZIP archive data, \
        at least v2.0 to extract

Key Features:

  • Level 0 rules: Start with offset (0, 16, 0x20)
  • Child rules: Prefixed with > characters (>, >>, >>>)
  • Comments: Lines starting with #
  • Empty lines: Should be ignored
  • Continuation: Lines ending with \ continue on next line
  • Special directives: !:mime, !:strength, !:ext

See docs/src/magic-format.md for complete format specification.

Technical Requirements

Core Function Signature

/// Parse a complete text-based magic file
///
/// # Arguments
/// * `input` - String content of the magic file
///
/// # Returns
/// * `Result<Vec<MagicRule>, ParseError>` - Top-level rules with nested children
///
/// # Errors
/// Returns ParseError with line number and description for:
/// - Invalid syntax
/// - Unrecognized types or operators
/// - Malformed offset specifications
/// - Orphaned child rules (> without parent)
pub fn parse_text_magic_file(input: &str) -> Result<Vec<MagicRule>, ParseError> {
    // Implementation needed
}

Implementation Components

  1. Line Processing Pipeline

    • Strip comments (preserve content before #)
    • Skip empty lines
    • Handle continuation lines (join lines ending with \)
    • Track original line numbers for error reporting
  2. Rule Level Detection

    • Count leading > characters to determine hierarchy level
    • Level 0: No > prefix
    • Level 1: > prefix
    • Level 2: >> prefix, etc.
  3. Rule Parsing

    • Extract offset, type, operator, value, and message from each line
    • Use existing parse_offset, parse_value, etc. from grammar.rs
    • Handle optional operator (default to Operator::Equal)
    • Parse message text (may contain escape sequences)
  4. Hierarchy Building

    • Maintain a stack of parent rules at each level
    • Attach child rules to the appropriate parent based on level
    • Validate that child rules have valid parents
    • Error if level increases by more than 1
  5. Special Directive Handling (optional for v1)

    • !:mime - MIME type metadata
    • !:strength - Match strength/priority
    • !:ext - File extension hints
    • Store as metadata on the last parsed rule
  6. Error Handling

    • Include line number in all error messages
    • Provide descriptive error messages (e.g., "Invalid offset specification at line 42")
    • Continue parsing after non-fatal errors (optional: collect all errors)

Proposed Solution

Phase 1: Basic Line Processing

// In src/parser/mod.rs

struct LineInfo {
    content: String,
    line_number: usize,
    level: u32,
}

fn preprocess_lines(input: &str) -> Result<Vec<LineInfo>, ParseError> {
    // 1. Handle continuation lines
    // 2. Strip comments
    // 3. Detect hierarchy level (count >)
    // 4. Track line numbers
}

Phase 2: Rule Parsing

fn parse_magic_rule_line(line: &LineInfo) -> Result<MagicRule, ParseError> {
    // Use nom combinators with existing grammar.rs functions
    // Pattern: offset  type  [operator]  value  message
}

Phase 3: Hierarchy Construction

fn build_rule_hierarchy(lines: Vec<LineInfo>) -> Result<Vec<MagicRule>, ParseError> {
    // Stack-based approach to build parent-child relationships
    // Validate level transitions
}

Phase 4: Integration

pub fn parse_text_magic_file(input: &str) -> Result<Vec<MagicRule>, ParseError> {
    let lines = preprocess_lines(input)?;
    let rules = lines.into_iter()
        .map(|line| parse_magic_rule_line(&line))
        .collect::<Result<Vec<_>, _>>()?;
    build_rule_hierarchy(rules)
}

Testing Requirements

Unit Tests (Required)

#[cfg(test)]
mod tests {
    #[test]
    fn test_parse_simple_rule() {
        let input = "0    string    PK\\x03\\x04    ZIP archive";
        let rules = parse_text_magic_file(input).unwrap();
        assert_eq!(rules.len(), 1);
        assert_eq!(rules[0].message, "ZIP archive");
    }

    #[test]
    fn test_parse_hierarchical_rules() {
        let input = r#"
0       string    \x7fELF         ELF
>4      byte      1               32-bit
>4      byte      2               64-bit
        "#;
        let rules = parse_text_magic_file(input).unwrap();
        assert_eq!(rules.len(), 1);
        assert_eq!(rules[0].children.len(), 2);
    }

    #[test]
    fn test_parse_comments_and_empty_lines() {
        let input = r#"
# This is a comment

0       string    test    Test file
        "#;
        let rules = parse_text_magic_file(input).unwrap();
        assert_eq!(rules.len(), 1);
    }

    #[test]
    fn test_parse_continuation_lines() {
        let input = "0    string    test    Long message \\\n        continued here";
        let rules = parse_text_magic_file(input).unwrap();
        assert!(rules[0].message.contains("continued"));
    }

    #[test]
    fn test_error_orphaned_child() {
        let input = ">4    byte    1    orphaned";
        assert!(parse_text_magic_file(input).is_err());
    }

    #[test]
    fn test_error_invalid_level_jump() {
        let input = r#"
0       string    test    Parent
>>>4    byte      1       Invalid jump
        "#;
        assert!(parse_text_magic_file(input).is_err());
    }
}

Integration Tests (Recommended)

  • Parse actual magic files from third_party/tests/*.magic
  • Validate against known-good outputs
  • Performance testing with large magic databases

Acceptance Criteria

  • parse_text_magic_file function implemented in src/parser/mod.rs
  • Line preprocessing handles comments, empty lines, continuation lines
  • Hierarchy detection based on > prefix works correctly
  • Rule parsing integrates existing grammar.rs functions
  • Parent-child relationships built correctly
  • Error messages include line numbers
  • At least 10 unit tests covering various scenarios
  • All existing tests continue to pass
  • Documentation updated with examples
  • Code passes cargo clippy -- -D warnings

Dependencies

  • Existing parser components in src/parser/grammar.rs
  • AST structures in src/parser/ast.rs
  • Error types in src/error.rs

Related Work

  • Phase 1 MVP completion depends on this parser
  • Unblocks evaluator implementation (next major milestone)
  • Enables integration testing with real magic files

References

Metadata

Metadata

Labels

enhancementNew feature or requestgood first issueGood for newcomershelp wantedExtra attention is neededparserMagic file parsing components and grammartestingTest infrastructure and coverage
No fields configured for Sub-task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions