Overview
Implement a complete text-based magic file parser that reads entire files and converts them into a hierarchical tree of MagicRule structures. This is a critical component for Phase 1 MVP completion, as it bridges the gap between existing parser components (offsets, types, operators, values) and the evaluator engine.
Background
The project has completed core parsing components in src/parser/grammar.rs:
- ✅
parse_number - Parses decimal, hex, and octal numbers
- ✅
parse_offset - Parses offset specifications (absolute, indirect, relative)
- ✅
parse_operator - Parses comparison operators (=, !=, <, >, &)
- ✅
parse_value - Parses values (strings, numbers, byte sequences)
The AST structures in src/parser/ast.rs are also complete with full serialization support.
What's Missing: A higher-level parser that orchestrates these components to parse complete magic files line-by-line, handling:
- File-level structure and organization
- Line continuation and comments
- Hierarchical rule nesting based on indentation
- Error reporting with line numbers
- Special directives (
!:mime, !:strength, etc.)
Magic File Format Reference
Magic files follow this structure:
# Comment lines start with #
offset type operator value message
# Example: ELF file detection
0 string \x7fELF ELF
>4 byte 1 32-bit
>4 byte 2 64-bit
>>16 leshort >0 executable
# Continuation lines end with backslash\
0 string PK\003\004 ZIP archive data, \
at least v2.0 to extract
Key Features:
- Level 0 rules: Start with offset (0, 16, 0x20)
- Child rules: Prefixed with
> characters (>, >>, >>>)
- Comments: Lines starting with
#
- Empty lines: Should be ignored
- Continuation: Lines ending with
\ continue on next line
- Special directives:
!:mime, !:strength, !:ext
See docs/src/magic-format.md for complete format specification.
Technical Requirements
Core Function Signature
/// Parse a complete text-based magic file
///
/// # Arguments
/// * `input` - String content of the magic file
///
/// # Returns
/// * `Result<Vec<MagicRule>, ParseError>` - Top-level rules with nested children
///
/// # Errors
/// Returns ParseError with line number and description for:
/// - Invalid syntax
/// - Unrecognized types or operators
/// - Malformed offset specifications
/// - Orphaned child rules (> without parent)
pub fn parse_text_magic_file(input: &str) -> Result<Vec<MagicRule>, ParseError> {
// Implementation needed
}
Implementation Components
-
Line Processing Pipeline
- Strip comments (preserve content before
#)
- Skip empty lines
- Handle continuation lines (join lines ending with
\)
- Track original line numbers for error reporting
-
Rule Level Detection
- Count leading
> characters to determine hierarchy level
- Level 0: No
> prefix
- Level 1:
> prefix
- Level 2:
>> prefix, etc.
-
Rule Parsing
- Extract offset, type, operator, value, and message from each line
- Use existing
parse_offset, parse_value, etc. from grammar.rs
- Handle optional operator (default to
Operator::Equal)
- Parse message text (may contain escape sequences)
-
Hierarchy Building
- Maintain a stack of parent rules at each level
- Attach child rules to the appropriate parent based on level
- Validate that child rules have valid parents
- Error if level increases by more than 1
-
Special Directive Handling (optional for v1)
!:mime - MIME type metadata
!:strength - Match strength/priority
!:ext - File extension hints
- Store as metadata on the last parsed rule
-
Error Handling
- Include line number in all error messages
- Provide descriptive error messages (e.g., "Invalid offset specification at line 42")
- Continue parsing after non-fatal errors (optional: collect all errors)
Proposed Solution
Phase 1: Basic Line Processing
// In src/parser/mod.rs
struct LineInfo {
content: String,
line_number: usize,
level: u32,
}
fn preprocess_lines(input: &str) -> Result<Vec<LineInfo>, ParseError> {
// 1. Handle continuation lines
// 2. Strip comments
// 3. Detect hierarchy level (count >)
// 4. Track line numbers
}
Phase 2: Rule Parsing
fn parse_magic_rule_line(line: &LineInfo) -> Result<MagicRule, ParseError> {
// Use nom combinators with existing grammar.rs functions
// Pattern: offset type [operator] value message
}
Phase 3: Hierarchy Construction
fn build_rule_hierarchy(lines: Vec<LineInfo>) -> Result<Vec<MagicRule>, ParseError> {
// Stack-based approach to build parent-child relationships
// Validate level transitions
}
Phase 4: Integration
pub fn parse_text_magic_file(input: &str) -> Result<Vec<MagicRule>, ParseError> {
let lines = preprocess_lines(input)?;
let rules = lines.into_iter()
.map(|line| parse_magic_rule_line(&line))
.collect::<Result<Vec<_>, _>>()?;
build_rule_hierarchy(rules)
}
Testing Requirements
Unit Tests (Required)
#[cfg(test)]
mod tests {
#[test]
fn test_parse_simple_rule() {
let input = "0 string PK\\x03\\x04 ZIP archive";
let rules = parse_text_magic_file(input).unwrap();
assert_eq!(rules.len(), 1);
assert_eq!(rules[0].message, "ZIP archive");
}
#[test]
fn test_parse_hierarchical_rules() {
let input = r#"
0 string \x7fELF ELF
>4 byte 1 32-bit
>4 byte 2 64-bit
"#;
let rules = parse_text_magic_file(input).unwrap();
assert_eq!(rules.len(), 1);
assert_eq!(rules[0].children.len(), 2);
}
#[test]
fn test_parse_comments_and_empty_lines() {
let input = r#"
# This is a comment
0 string test Test file
"#;
let rules = parse_text_magic_file(input).unwrap();
assert_eq!(rules.len(), 1);
}
#[test]
fn test_parse_continuation_lines() {
let input = "0 string test Long message \\\n continued here";
let rules = parse_text_magic_file(input).unwrap();
assert!(rules[0].message.contains("continued"));
}
#[test]
fn test_error_orphaned_child() {
let input = ">4 byte 1 orphaned";
assert!(parse_text_magic_file(input).is_err());
}
#[test]
fn test_error_invalid_level_jump() {
let input = r#"
0 string test Parent
>>>4 byte 1 Invalid jump
"#;
assert!(parse_text_magic_file(input).is_err());
}
}
Integration Tests (Recommended)
- Parse actual magic files from
third_party/tests/*.magic
- Validate against known-good outputs
- Performance testing with large magic databases
Acceptance Criteria
Dependencies
- Existing parser components in
src/parser/grammar.rs
- AST structures in
src/parser/ast.rs
- Error types in
src/error.rs
Related Work
- Phase 1 MVP completion depends on this parser
- Unblocks evaluator implementation (next major milestone)
- Enables integration testing with real magic files
References
Overview
Implement a complete text-based magic file parser that reads entire files and converts them into a hierarchical tree of
MagicRulestructures. This is a critical component for Phase 1 MVP completion, as it bridges the gap between existing parser components (offsets, types, operators, values) and the evaluator engine.Background
The project has completed core parsing components in
src/parser/grammar.rs:parse_number- Parses decimal, hex, and octal numbersparse_offset- Parses offset specifications (absolute, indirect, relative)parse_operator- Parses comparison operators (=, !=, <, >, &)parse_value- Parses values (strings, numbers, byte sequences)The AST structures in
src/parser/ast.rsare also complete with full serialization support.What's Missing: A higher-level parser that orchestrates these components to parse complete magic files line-by-line, handling:
!:mime,!:strength, etc.)Magic File Format Reference
Magic files follow this structure:
Key Features:
>characters (>, >>, >>>)#\continue on next line!:mime,!:strength,!:extSee
docs/src/magic-format.mdfor complete format specification.Technical Requirements
Core Function Signature
Implementation Components
Line Processing Pipeline
#)\)Rule Level Detection
>characters to determine hierarchy level>prefix>prefix>>prefix, etc.Rule Parsing
parse_offset,parse_value, etc. fromgrammar.rsOperator::Equal)Hierarchy Building
Special Directive Handling (optional for v1)
!:mime- MIME type metadata!:strength- Match strength/priority!:ext- File extension hintsError Handling
Proposed Solution
Phase 1: Basic Line Processing
Phase 2: Rule Parsing
Phase 3: Hierarchy Construction
Phase 4: Integration
Testing Requirements
Unit Tests (Required)
Integration Tests (Recommended)
third_party/tests/*.magicAcceptance Criteria
parse_text_magic_filefunction implemented insrc/parser/mod.rs>prefix works correctlycargo clippy -- -D warningsDependencies
src/parser/grammar.rssrc/parser/ast.rssrc/error.rsRelated Work
References
docs/src/magic-format.mdthird_party/tests/*.magic