Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
9d51e49
chore(deps): update mise.lock format
unclesp1d3r Mar 9, 2026
526aec5
docs: update project overview and development stage to v0.5.0
unclesp1d3r Mar 9, 2026
fcb47c3
feat(evaluator): implement PString type with evaluation and reading t…
unclesp1d3r Mar 9, 2026
e624328
feat(parser): implement PString type with parsing and serialization s…
unclesp1d3r Mar 9, 2026
35eee78
test(parser): add tests for serializing PString type
unclesp1d3r Mar 9, 2026
bf72f19
test(evaluator): add tests for PString rule matching and truncation
unclesp1d3r Mar 9, 2026
1ce2fde
feat(docs): add PString type support to AGENTS.md documentation
unclesp1d3r Mar 9, 2026
b1a22db
feat(docs): update documentation for PString type implementation
unclesp1d3r Mar 9, 2026
0e6dd51
fix: address PR review findings for pstring implementation
unclesp1d3r Mar 9, 2026
3c073c3
Merge branch 'main' into 43-parser-implement-pstring-pascal-string-type
unclesp1d3r Mar 9, 2026
5cc7d5d
fix: simplify parser comment for string type ordering
unclesp1d3r Mar 9, 2026
1bbba3a
docs: document pstring max_length bounds validation semantics
unclesp1d3r Mar 9, 2026
afce876
docs: Dosu updates for PR #170
dosubot[bot] Mar 9, 2026
f9ad879
docs: Dosu updates for PR #170
dosubot[bot] Mar 9, 2026
949b966
docs: Dosu updates for PR #170
dosubot[bot] Mar 9, 2026
5110fae
docs: Dosu updates for PR #170
dosubot[bot] Mar 9, 2026
d30fc62
docs: Dosu updates for PR #170
dosubot[bot] Mar 9, 2026
d021555
docs: Dosu updates for PR #170
dosubot[bot] Mar 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 74 additions & 18 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,13 @@

libmagic-rs is a **pure-Rust implementation of libmagic** for file type identification. The project follows a **parser-evaluator architecture** with strict memory safety guarantees and zero unsafe code.

### Development Stage: MVP Phase (v0.1)
### Development Stage: v0.5.0 (Active Development)

- ✅ **Core AST and parser components** are complete with 98 unit tests
- 🔄 **Currently implementing**: Complete magic file rule parsing (`src/parser/mod.rs`)
- 📋 **Next**: Rule evaluation engine (`src/evaluator/`) and output formatters (`src/output/`)
- ✅ **Parser**: Complete with AST structures, grammar parsing, type handling, and hierarchy support
- ✅ **Evaluator**: Fully implemented with offset resolution, type interpretation, operator application, and strength calculation
- ✅ **Output**: Text and JSON formatters with comprehensive metadata
- ✅ **CLI**: Full-featured `rmagic` binary with multiple file support, stdin, built-in rules, and custom magic files
- 🔄 **Currently implementing**: Enhanced type support (regex, search patterns) and indirect/relative offset evaluation

## Architecture Patterns

Expand All @@ -22,10 +24,21 @@ Target File → Memory Mapper → File Buffer

### Module Structure (Follow This Pattern)

- **`src/parser/`**: `ast.rs` (complete), `grammar.rs` (nom parsers), `mod.rs` (rule integration)
- **`src/io/`**: Memory-mapped FileBuffer with comprehensive bounds checking (complete)
- **`src/evaluator/`**: Offset resolution, type interpretation, operators (planned)
- **`src/output/`**: Text and JSON formatters (planned)
- **`src/parser/`**: Complete parsing system
- `ast.rs`: AST node definitions (MagicRule, TypeKind, Operator, Value, OffsetSpec)
- `grammar/`: nom-based parser combinators for magic file syntax
- `types.rs`: Type keyword parsing and validation
- `hierarchy.rs`: Hierarchical rule structure handling
- `loader.rs`: Magic file loading and preprocessing
- `codegen.rs`: Serialization for build-time rule compilation
- **`src/evaluator/`**: Rule evaluation engine
- `engine/`: Core evaluation logic and rule matching
- `offset/`: Offset resolution (absolute, from-end; indirect/relative stubs)
- `operators/`: Operator application (equality, comparison, bitwise)
- `types/`: Type interpretation with endianness handling
- `strength.rs`: Confidence scoring and strength modifiers
- **`src/io/`**: Memory-mapped FileBuffer with SafeBufferAccess trait for bounds checking
- **`src/output/`**: Result formatting (text.rs, json.rs) with metadata support

## Critical Development Practices

Expand Down Expand Up @@ -77,7 +90,7 @@ pub struct MagicRule {

```bash
cargo check # Fast syntax/type checking (use frequently)
cargo test # Run 98 unit tests (currently all passing)
cargo test # Run 1,068+ unit tests (currently all passing)
cargo nextest run # Faster test execution (preferred)
cargo clippy -- -D warnings # Required - zero warnings policy
cargo fmt # Code formatting
Expand All @@ -89,6 +102,7 @@ cargo fmt # Code formatting
- **Property testing**: Use `proptest` for fuzzing-style tests
- **Error case testing**: Validate all `Result<T, E>` error paths
- **Serialization testing**: All AST types use serde, test round-trip
- **Table-driven tests**: Consolidate related test cases with descriptive failure messages

### Performance Focus

Expand Down Expand Up @@ -140,15 +154,25 @@ pub enum LibmagicError {

## Magic File Compatibility

### Supported Syntax (Implement in parser/mod.rs)
### Supported Syntax (Currently Implemented in v0.5.0)

- **Offsets**: `0x10`, `(0x20.l+4)`, `&0x30`
- **Types**: `byte`, `short`, `long`, `string`, `regex` with endianness (`be`, `le`)
- **Operators**: `=`, `!=`, `>`, `<`, `&` (bitwise AND), `^` (XOR)
- **Offsets**: Absolute, from-end (indirect and relative are parsed but not yet evaluated)
- **Types**: `byte`, `short`, `long`, `quad`, `float`, `double`, `string`, `pstring` with endianness support; unsigned variants `ubyte`, `ushort`/`ubeshort`/`uleshort`, `ulong`/`ubelong`/`ulelong`, `uquad`/`ubequad`/`ulequad`; float/double endian variants `befloat`/`lefloat`, `bedouble`/`ledouble`; 32-bit date/timestamp types `date`/`ldate`/`bedate`/`beldate`/`ledate`/`leldate`; 64-bit date/timestamp types `qdate`/`qldate`/`beqdate`/`beqldate`/`leqdate`/`leqldate`
- **Operators**: `=` (equal), `!=` (not equal), `<` (less than), `>` (greater than), `<=` (less equal), `>=` (greater equal), `&` (bitwise AND with optional mask), `^` (bitwise XOR), `~` (bitwise NOT), `x` (any value)
- **Nesting**: Hierarchical rules with proper indentation handling
- **String Matching**: Exact string matching with null-termination
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “Supported Syntax” section lists pstring as a supported type, but the “String Matching” bullet still only mentions null-terminated strings. Updating that bullet to also mention Pascal (length-prefixed) strings would keep the documentation consistent with the implemented feature set.

Suggested change
- **String Matching**: Exact string matching with null-termination
- **String Matching**: Exact string matching for null-terminated (`string`) and Pascal/length-prefixed (`pstring`) strings

Copilot uses AI. Check for mistakes.
- **Directives**: `!:strength` modifier (parsed and applied)

### Planned Features (v1.0+)

- Regex type: Pattern matching with binary-safe regex support
- Search type: Multi-pattern string searching
- Additional directives: `!:mime`, `!:ext`, `!:apple`

### Binary-Safe Regex

> **Note:** The regex type is planned for future releases and is not yet implemented (#39).

```rust
// Use regex crate with bytes feature for binary-safe matching
use regex::bytes::Regex;
Expand All @@ -160,15 +184,20 @@ use regex::bytes::Regex;
### Completed (Don't Reimplement)

- ✅ **AST structures** (`src/parser/ast.rs`) - fully tested with serde
- ✅ **Parser components** (`src/parser/grammar.rs`) - numbers, offsets, operators, values
- ✅ **Parser components** (`src/parser/grammar/`) - complete magic file syntax parsing
- ✅ **Type system** (`src/parser/types.rs`) - byte, short, long, quad, float, double, string, pstring, date types
- ✅ **File I/O** (`src/io/mod.rs`) - memory-mapped FileBuffer with bounds checking
- ✅ **CLI framework** (`src/main.rs`) - clap-based argument parsing
- ✅ **CLI framework** (`src/main.rs`) - clap-based argument parsing with JSON output
- ✅ **Evaluator engine** (`src/evaluator/`) - complete rule evaluation with strength calculation
- ✅ **Output formatters** (`src/output/`) - text and JSON formatters with metadata

### Active Development (Contribute Here)

- 🔄 **Rule parsing** (`src/parser/mod.rs`) - integrate components into complete rules
- 📋 **Evaluator engine** (`src/evaluator/mod.rs`) - offset resolution, type interpretation
- 📋 **Output formatters** (`src/output/mod.rs`) - text and JSON result formatting
- 🔄 **Indirect offsets** (`src/evaluator/offset/indirect.rs`) - stub exists, needs implementation (#37)
- 🔄 **Relative offsets** (`src/evaluator/offset/relative.rs`) - stub exists, needs implementation (#38)
- 📋 **Regex type** - planned for future release (#39)
- 📋 **Search type** - planned for future release (#39)
- ✅ **Pascal strings** - implemented (#43)

## Code Quality Enforcement

Expand Down Expand Up @@ -206,4 +235,31 @@ pedantic = { level = "warn", priority = -1 }
- **FileBuffer → Evaluator**: Safe buffer access through trait methods
- **Results → Output**: Structured match results for formatters

## Common Tasks and Patterns

### Adding New Type Support

> **Note:** Currently implemented types are `Byte`, `Short`, `Long`, `Quad`, `Float`, `Double`, `String`, `PString`, and date/timestamp variants. Regex and other advanced types are planned for future releases.

1. Extend `TypeKind` enum in `src/parser/ast.rs`
2. Add keyword parsing in `src/parser/types.rs` (`parse_type_keyword` and `type_keyword_to_kind`)
3. Add value/operator parsing in `src/parser/grammar/mod.rs` if needed
4. Implement reading logic in `src/evaluator/types/` submodules
5. Update `serialize_type_kind()` in `src/parser/codegen.rs`
6. Add tests for the new type
7. Update documentation

### Adding New Operators

> **Note:** Currently implemented operators are `Equal`, `NotEqual`, `LessThan`, `GreaterThan`, `LessEqual`, `GreaterEqual`, `BitwiseAnd` (with `BitwiseAndMask`), `BitwiseXor`, `BitwiseNot`, and `AnyValue`.

1. Extend `Operator` enum in `src/parser/ast.rs`
2. Add parsing logic in `src/parser/grammar/mod.rs`
3. Implement operator logic in `src/evaluator/operators/` submodule
4. Update `serialize_operator()` in `src/parser/codegen.rs`
5. Update strength calculation match in `src/evaluator/strength.rs`
6. Update `arb_operator()` in `tests/property_tests.rs`
7. Add tests for the new operator
8. Update documentation

This guide ensures AI agents understand the project's strict safety requirements, current development focus, and established patterns for immediate productivity.
10 changes: 5 additions & 5 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,15 +208,14 @@ cargo test --doc # Test documentation examples
### Currently Implemented (v0.1.0)

- **Offsets**: Absolute and from-end specifications (indirect and relative are parsed but not yet evaluated)
- **Types**: `byte`, `short`, `long`, `quad`, `float`, `double`, `string` with endianness support; unsigned variants `ubyte`, `ushort`/`ubeshort`/`uleshort`, `ulong`/`ubelong`/`ulelong`, `uquad`/`ubequad`/`ulequad`; float/double endian variants `befloat`/`lefloat`, `bedouble`/`ledouble`; 32-bit date/timestamp types `date`/`ldate`/`bedate`/`beldate`/`ledate`/`leldate`; 64-bit date/timestamp types `qdate`/`qldate`/`beqdate`/`beqldate`/`leqdate`/`leqldate`; date values formatted as `"Www Mmm DD HH:MM:SS YYYY"` matching GNU `file` output; types are signed by default (libmagic-compatible)
- **Types**: `byte`, `short`, `long`, `quad`, `float`, `double`, `string`, `pstring` with endianness support; unsigned variants `ubyte`, `ushort`/`ubeshort`/`uleshort`, `ulong`/`ubelong`/`ulelong`, `uquad`/`ubequad`/`ulequad`; float/double endian variants `befloat`/`lefloat`, `bedouble`/`ledouble`; 32-bit date/timestamp types `date`/`ldate`/`bedate`/`beldate`/`ledate`/`leldate`; 64-bit date/timestamp types `qdate`/`qldate`/`beqdate`/`beqldate`/`leqdate`/`leqldate`; `pstring` is a Pascal string (length-prefixed byte followed by string data); date values formatted as `"Www Mmm DD HH:MM:SS YYYY"` matching GNU `file` output; types are signed by default (libmagic-compatible)
- **Operators**: `=` (equal), `!=` (not equal), `<` (less than), `>` (greater than), `<=` (less equal), `>=` (greater equal), `&` (bitwise AND with optional mask), `^` (bitwise XOR), `~` (bitwise NOT), `x` (any value)
- **Nested Rules**: Hierarchical rule evaluation with proper indentation
- **String Matching**: Exact string matching with null-termination
- **String Matching**: Exact string matching with null-termination and Pascal string (length-prefixed) support

### Planned Features (v1.0+)

- Regex type: Pattern matching with binary-safe regex support
- Additional types: pascal strings
- Search type: Multi-pattern string searching

### Future Enhancement: Binary-Safe Regex Handling
Expand All @@ -240,7 +239,8 @@ impl BinaryRegex for regex::bytes::Regex {

- No regex/search pattern matching
- 64-bit integer types: `quad`/`uquad`, `bequad`/`ubequad`, `lequad`/`ulequad` are implemented; `qquad` (128-bit) is not yet supported
- String evaluation reads until first NUL or end-of-buffer by default; `max_length: Some(_)` is supported internally but no dedicated fixed-length string parser syntax exists yet
- String evaluation reads until first NUL or end-of-buffer by default; `pstring` reads a length-prefixed Pascal string; `max_length: Some(_)` is supported internally but no dedicated fixed-length string parser syntax exists yet
- `pstring` only supports the default 1-byte length prefix (`/B`); multi-byte length prefix variants (`pstring/H` for 2-byte, `pstring/L` for 4-byte) are not yet implemented

### Operators

Expand Down Expand Up @@ -321,7 +321,7 @@ sample.bin: ELF 64-bit LSB executable, x86-64, version 1 (SYSV)

### Adding New Type Support

> **Note:** Currently implemented types are `Byte`, `Short`, `Long`, `Quad`, and `String`. Regex and other advanced types are planned for future releases.
> **Note:** Currently implemented types are `Byte`, `Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`, `String`, and `PString`. Regex and search types are planned for future releases.

1. Extend `TypeKind` enum in `src/parser/ast.rs`
2. Add keyword parsing in `src/parser/types.rs` (`parse_type_keyword` and `type_keyword_to_kind`)
Expand Down
15 changes: 13 additions & 2 deletions docs/MAGIC_FORMAT.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ Examples:
8 uquad >0x8000000000000000 (unsigned 64-bit check)
```

### String Type
### String Types

Match literal string data:

Expand All @@ -186,6 +186,16 @@ String escape sequences:
- `\t` - tab
- `\\` - backslash

**Pascal String (pstring)**

Length-prefixed string type where the first byte contains the string length (0-255), followed by that many bytes of string data. Unlike C strings, Pascal strings are not null-terminated.

```text
0 pstring =JPEG JPEG image (Pascal string)
```

The length byte value determines how many bytes to read for the string data. If `max_length` is specified in the magic file (not shown in the basic syntax), it caps the length byte value to prevent reading excessive data.

### String Flags

| Flag | Description |
Expand Down Expand Up @@ -525,7 +535,7 @@ Consider:
- Relative offsets
- Indirect offsets (basic)
- Byte, short, long, quad types (8-bit, 16-bit, 32-bit, 64-bit integers)
- String type
- String types (`string`, `pstring`)
- Date and timestamp types (32-bit and 64-bit Unix timestamps)
- Comparison operators (`=`, `!`, `<`, `>`, `<=`, `>=`)
- Bitwise AND operator
Expand All @@ -542,6 +552,7 @@ Consider:

### Recently Added

- **Pascal string type**: `pstring` for length-prefixed strings
- **Date/timestamp types**: `date` (32-bit) and `qdate` (64-bit) Unix timestamp types
- **Comparison operators**: Full support for `<`, `>`, `<=`, `>=` operators
- **Strength modifiers**: The `!:strength` directive for adjusting rule priority
Expand Down
1 change: 1 addition & 0 deletions docs/src/api-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,7 @@ use libmagic_rs::TypeKind;
| `Float { endian }` | 32-bit IEEE 754 floating-point (added in v0.5.0) |
| `Double { endian }` | 64-bit IEEE 754 double-precision floating-point (added in v0.5.0) |
| `String { max_length }` | String data (discriminant changed from 4 to 6 in v0.5.0) |
| `PString { max_length }` | Pascal string - length-prefixed byte followed by string data (returns `Value::String`) |

### Operator

Expand Down
1 change: 1 addition & 0 deletions docs/src/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ pub enum TypeKind {
Long { endian: Endianness, signed: bool },
Quad { endian: Endianness, signed: bool },
String { max_length: Option<usize> },
PString { max_length: Option<usize> }, // Pascal string (length-prefixed)
}

pub enum Operator {
Expand Down
39 changes: 38 additions & 1 deletion docs/src/ast-structures.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,9 @@ pub enum TypeKind {

/// String data
String { max_length: Option<usize> },

/// Pascal string (length-prefixed)
PString { max_length: Option<usize> },
}
```

Expand Down Expand Up @@ -226,6 +229,39 @@ let string_type = TypeKind::String {
};
```

### PString (Pascal String)

Pascal-style length-prefixed strings where the first byte contains the string length.

**Structure:**
- Length byte: 1 byte indicating string length (0-255)
- String data: The number of bytes specified by the length byte

**Example:**
```
0 pstring JPEG
```
Reads one byte as length, then reads that many bytes as a string.

**Behavior:**
- Returns `Value::String` containing the string data (without the length prefix)
- Performs bounds checking on both the length byte and the string data
- Supports all string comparison operators

**Usage:**

```rust
// Pascal string with no length limit
let pstring_type = TypeKind::PString {
max_length: None
};

// Pascal string with maximum 64-byte limit
let limited_pstring = TypeKind::PString {
max_length: Some(64)
};
```

### Endianness Options

```rust
Expand Down Expand Up @@ -419,7 +455,8 @@ let script_rule = MagicRule {
1. **Use `Byte { signed }`** for single-byte values and flags, specifying signedness
2. **Use `Short/Long/Quad`** with explicit endianness and signedness for multi-byte integers
3. **Use `String`** with length limits for text patterns
4. **Use `Bytes`** for exact binary sequences
4. **Use `PString`** for Pascal-style length-prefixed strings
5. **Use `Bytes`** for exact binary sequences

### Performance Considerations

Expand Down
18 changes: 16 additions & 2 deletions docs/src/magic-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,7 @@ Examples:
16 leqldate x \b, timestamp %s
```

### String Type
### String Types

Match literal string data:

Expand All @@ -225,6 +225,20 @@ String escape sequences:
- `\t` - tab
- `\\` - backslash

### Pascal String Type

Pascal string (pstring) is a length-prefixed string type. The first byte contains the string length (0-255), followed by that many bytes of string data. Unlike C strings, Pascal strings are not null-terminated.

```text
0 pstring =JPEG JPEG image (Pascal string)
```

The evaluator reads the length byte, then reads that many bytes as string data. The optional max_length parameter caps the length byte value:

```text
0 pstring x \b, name: %s
```

### String Flags (Not Yet Implemented)

> **Note:** String flags are documented for libmagic compatibility reference but are not yet implemented in libmagic-rs.
Expand Down Expand Up @@ -535,7 +549,7 @@ Consider:
- Byte, short, long, quad types (8-bit, 16-bit, 32-bit, 64-bit integers)
- Float and double types (32-bit and 64-bit IEEE 754 floating-point)
- Date and qdate types (32-bit and 64-bit Unix timestamps)
- String type
- String and pstring types (null-terminated and length-prefixed strings)
- Comparison operators (equal, not-equal, less-than, greater-than, less-equal, greater-equal)
- Bitwise AND operator
- Nested rules
Expand Down
40 changes: 40 additions & 0 deletions docs/src/parser.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,46 @@ Parsed literals are stored as `Value::Float(f64)` in the AST, regardless of whet

**Note:** Float and double types do **not** have signed/unsigned variants. IEEE 754 handles sign internally via the sign bit, so all float types use a single `TypeKind` variant with only an `endian` field (no `signed: bool` field).

### Pascal String (pstring) Type

The parser supports Pascal-style length-prefixed strings through the `pstring` keyword:

**Type Keyword:**

- `pstring` - Length-prefixed string (1-byte length + string data) → `TypeKind::PString { max_length: None }`

**Format:**

Pascal strings store the length as the first byte (0-255), followed by that many bytes of string data. Unlike C strings, they are not null-terminated.

**Parser Implementation:**

- Recognized by `parse_type_keyword()` in `src/parser/types.rs`
- Maps to `TypeKind::PString` in the AST
- Evaluator reads length prefix byte then that many bytes as string data
- Stored as `Value::String` for comparison with string operators
- Supports optional `max_length` field to cap the length byte value

**Usage in Magic Rules:**

```rust
// Basic pstring matching
0 pstring =Hello // Match if pstring equals "Hello"
0 pstring x // Match any pstring value

// With max_length constraint (parsed separately)
0 pstring/64 x // Limit string read to 64 bytes
```

**Features:**

- ✅ Single type keyword `pstring`
- ✅ Length-prefixed format (1 byte length, 0-255 bytes data)
- ✅ Bounds checking for both length byte and string data
- ✅ UTF-8 validation with replacement character for invalid sequences
- ✅ Optional `max_length` parameter to limit string reads
- ✅ String comparison operators work with pstring values

### Date and Timestamp Types

The parser supports date and timestamp types for parsing Unix timestamps (signed seconds since epoch). There are 12 type keywords:
Expand Down
Loading
Loading