Skip to content

feat(evaluator): regex and search types (closes #39)#214

Merged
unclesp1d3r merged 8 commits into
mainfrom
39-parser-implement-regex-and-search-types
Apr 11, 2026
Merged

feat(evaluator): regex and search types (closes #39)#214
unclesp1d3r merged 8 commits into
mainfrom
39-parser-implement-regex-and-search-types

Conversation

@unclesp1d3r
Copy link
Copy Markdown
Member

Summary

Implements TypeKind::Regex and TypeKind::Search for libmagic rule evaluation, closing the full acceptance criteria for #39. This is the complete implementation including the spec-compliance follow-up work that was flagged in pre-PR review.

Highlights:

  • Regex type: binary-safe matching via regex::bytes::Regex with the full /c, /s, /l flag set plus optional numeric count. Scan window capped at 8192 bytes (FILE_REGEX_MAX). Multi-line mode is unconditional (matching libmagic's REG_NEWLINE).
  • Search type: bounded literal scan via memchr::memmem::find with mandatory NonZeroUsize range (per GNU file magic(5)). Anchor advances to match-end, matching softmagic.c::FILE_SEARCH.
  • RegexFlags struct: case_insensitive, start_offset, line_based — designed as a struct so future flag additions don't cascade through 10+ exhaustive-match sites.
  • Engine dispatch: pattern-bearing types bypass apply_operator and drive Equal/NotEqual from a structured Option<Value> returned by read_pattern_match. Non-equality operators on pattern types are rejected as TypeReadError::UnsupportedType rather than silently producing lexicographic comparisons against the pattern literal.
  • Zero-width match correctness: regex matches of zero bytes (^, a*, lookaheads) are preserved as Some(Value::String("")) and distinguished from genuine misses (None).

Ground truth for every non-obvious semantic was verified against GNU file's src/softmagic.c before implementing (FILE_REGEX, FILE_SEARCH, moffset(), alloc_regex, parse_string_modifier, FILE_REGEX_MAX).

Issue #39 Acceptance Criteria

# Criterion Status
1 TypeKind extended with Regex/Search variants
2 Parser recognizes regex/search + flags (/c, /s, /l, count)
3 Evaluator uses regex::bytes::Regex for binary-safe matching
4 Search supports range-limited byte sequence searching ✅ mandatory NonZeroUsize
5 Common regex flags: case-insensitive + line-start/end /c + /l + /s
6 Unit tests for regex/search matching ✅ ~50 new tests
7 Integration tests with text file detection rules ✅ searchbug, json1, jsonlines1, cmd1, gedcom, regex-eol

Test Plan

  • mise exec -- just ci-check green (run locally: 1207/1207 tests pass, clippy clean, fmt clean, doctests pass)
  • CodeRabbit review clean
  • Manual smoke: echo '{"ok": true}' | cargo run -- --use-builtin - still detects JSON via built-ins (regex rules not added to built-in set, so this should still fall back to string match behavior)
  • Verify corpus tests run against third_party/tests fixtures on a fresh checkout
  • Confirm regex-eol.magic scenario passes end-to-end (exercised by tests/evaluator_tests.rs::test_regex_eol_corpus which was rewritten for the new flag semantics)
  • Spot-check that regex/1l and regex/l1 both parse identically (per GNU file "interleaved order, last range wins")
  • Confirm search/0 and bare search are parse errors (not silent acceptance)
  • Confirm regex/Ns-style non-equality operators are rejected (regex < "[0-9]+" should error, not match via lexicographic comparison)

Notes for Reviewers

  • The branch history includes an earlier "partial" foundation commit (cb02472) followed by the completion commit (7294fd6). The partial commit shipped a working shell with known spec gaps; the completion commit closes those gaps (the V1V5 items flagged in internal review).
  • GOTCHAS.md sections 2.4–2.9 document every non-obvious invariant added by this PR, with citations to the corresponding GNU file source locations. Start there when reviewing semantics questions.
  • The new compute_window helper in src/evaluator/types/regex.rs is the load-bearing piece of V1 — it's where the 8192-byte cap, line-based counting, and CRLF handling all converge. The unit tests at the bottom of that file cover each code path explicitly.
  • read_pattern_match is the engine-facing entry point for pattern types; read_typed_value_with_pattern is kept as a back-compat wrapper that collapses None to Value::String(""). Future pattern-type work should go through read_pattern_match.

Closes #39

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Implements pattern-based file matching via `TypeKind::Regex` and
`TypeKind::Search`, a foundation for text file detection (JSON, XML,
shell scripts, etc.).

Parser:
* Add `Regex { case_insensitive, start_of_line }` and `Search { range }`
  variants to `TypeKind` in `parser/ast.rs`. Extend `parse_type_keyword`,
  `type_keyword_to_kind`, grammar suffix parsing (`regex/c`, `regex/l`,
  `regex/cl`, `search/N`), `serialize_type_kind`, and strength modifiers.
* Update `arb_type_kind` in `tests/property_tests.rs` for the fuzz
  round-trip of the new variants.
* Grammar flag scanner uses `strip_prefix` instead of direct `&rest[1..]`
  slicing per AGENTS.md memory-safety rule.

Evaluator:
* `src/evaluator/types/regex.rs` wraps `regex::bytes::RegexBuilder` for
  binary-safe matching. `/c` enables case-insensitive, `/l` wraps the
  pattern in `^(?:...)` with multi-line mode (placeholder for the
  libmagic `/l` line-count semantics tracked for v0.3).
* `src/evaluator/types/search.rs` uses `memchr::memmem::find` for
  bounded literal scanning.
* Both readers return `Result<Option<Value>, TypeReadError>`: `None`
  is a genuine miss, `Some(v)` is a match (including legitimate
  zero-width regex matches like `^`, `a*`, lookaheads).
* Add `read_typed_value_with_pattern` and `read_pattern_match` dispatch
  functions in `types/mod.rs`. The engine calls `read_pattern_match`
  directly to preserve the `Option` discriminant; `read_typed_value`
  remains a 3-arg convenience wrapper forwarding `pattern: None`.
* `bytes_consumed_with_pattern` threads the pattern through to
  `regex_bytes_consumed` and `search_bytes_consumed`. The search arm
  re-runs `memchr::memmem::find` and returns `match_idx + pattern.len()`
  (match-end), matching GNU `file` softmagic.c FILE_SEARCH semantics.
  An earlier revision returned the full window size, which silently
  corrupted relative-offset children of every successful search match.
* Add `debug_assert!` guards on the silent-0 fallback paths in
  `regex_bytes_consumed`, `search_bytes_consumed`, and the Regex/Search
  arms of `bytes_consumed_with_pattern` so engine-invariant violations
  are caught loudly in dev/test builds.

Engine:
* `evaluate_single_rule_with_anchor` splits into two arms. Pattern-
  bearing types (`Regex`, `Search`) go through `read_pattern_match` and
  drive `Equal`/`NotEqual` from the `Option` discriminant directly.
  Non-equality operators on pattern types return
  `TypeReadError::UnsupportedType` instead of silently falling through
  to `apply_operator`, which would have produced nonsense lexicographic
  comparisons against the pattern source text.
* Non-pattern types continue through the existing
  `read_typed_value_with_pattern` + `coerce_value_to_type` +
  `apply_operator` path.

Dependencies:
* Move `regex = "1.12.3"` from dev-dependencies to dependencies. The
  `bytes` feature was dropped in v1.12; `regex::bytes` is unconditional
  so no feature flag is needed.

Note: This commit covers the core feature. It does NOT yet implement
the `/s` (start-offset) flag, the libmagic `/l` line-count semantics,
the `regex/Nl` numeric prefix, `NonZeroUsize` search ranges, or the
8192-byte default regex scan window. Those are tracked for a follow-up
against issue #39 before closing it.

Refs #39

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Regression guards for the three silent-failure classes flagged in code
review:

bytes_consumed dispatch (GOTCHAS 2.1):
* `test_bytes_consumed_regex_with_string_pattern` and
  `test_bytes_consumed_regex_no_match_returns_zero` cover the Regex arm
  of `bytes_consumed_with_pattern`. Before this test the only coverage
  of the Regex arm was transitive through the corpus test and did not
  assert numeric anchor positions.
* `test_bytes_consumed_regex_zero_width_match_returns_zero` guards the
  interaction between the C3 fix (zero-width matches distinguished from
  misses) and the anchor advance: a zero-width match returns `m.end() == 0`
  so the anchor does not move.
* `test_bytes_consumed_search_with_pattern_is_match_end` is the
  regression guard for the pre-fix behavior that returned the full
  search window size instead of `match_idx + pattern.len()`. Buffer
  `"abcWorld_xyz"` with `search/10 "World"` must advance the anchor by
  8, not by 10.
* `test_bytes_consumed_search_bytes_pattern_works` exercises the
  `Value::Bytes` pattern dispatch path that was previously accepted by
  the reader but untested.

Engine-level Relative(N) after pattern types:
* `test_regex_parent_advances_anchor_for_relative_child` and
  `test_search_parent_advances_anchor_to_match_end_not_window_end`
  construct parent/child rule trees and assert that a `Relative(0)`
  child reads the byte immediately after the parent's match. This is
  the end-to-end test that would have caught the search window-vs-
  match-end bug -- the existing `test_regex_eol_corpus` passes on
  substring assertions and would not have detected a wrong anchor
  position.
* `test_search_parent_relative_child_at_positive_offset` exercises
  `Relative(+1)` to confirm the match-end base is correct (not
  match-start, not window-end).

Non-equality operator rejection:
* `test_regex_rule_with_ordering_operator_is_rejected` and
  `test_search_rule_with_bitwise_operator_is_rejected` verify the
  engine returns `EvaluationError::UnsupportedType` when a
  pattern-bearing rule uses `<`, `>`, `&`, `^`, `~`, etc. Previously
  those fell through to `apply_operator` and silently produced
  lexicographic comparisons against the pattern literal -- a
  false-positive factory for any regex with metacharacters.

Zero-width regex match coverage (in `src/evaluator/types/regex.rs`):
* `test_read_regex_zero_width_start_anchor_matches` asserts `^` returns
  `Some(Value::String(""))`, not `None`. The pre-fix engine flattened
  both cases into a miss via `is_empty()`.
* `test_read_regex_zero_width_star_matches_empty` does the same for the
  `a*` pattern against a buffer with no `a` bytes.

Refs #39

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
AGENTS.md:
* Move regex and search from "Planned Features" / "Current Limitations:
  No regex/search" to "Currently Implemented". The prior state had six
  separate stale references that directly contradicted the
  implementation.
* Add an explicit "Current Limitations" entry for the `/l` flag
  (implemented as a multi-line `^` anchor, not the libmagic line-count
  semantic), the optional `search` range (should be mandatory per GNU
  file), and the missing 8192-byte default regex scan window. These
  are the remaining #39 acceptance-criteria gaps, tracked for v0.3.
* Remove the speculative `BinaryRegex` trait sketch -- the PR ships
  binary-safe matching via `regex::bytes` directly, not via a wrapper
  trait.
* Update "Key Dependencies" to reflect `regex` as a runtime dependency
  and add `memchr` to the list.
* Update the "Adding New Type Support" note to include Regex and
  Search in the implemented-types list.

GOTCHAS.md:
* Rewrite section 2.4 to describe the `read_pattern_match` ->
  `Option<Value>` contract and the hard rejection of non-equality
  operators on pattern-bearing types (previously the fallthrough to
  `apply_operator` was tolerated).
* Add section 2.5: zero-width regex match vs miss invariant. The
  `Option` in `read_regex`/`read_search` is load-bearing -- do not
  re-flatten to `Value::String("")` sentinels.
* Add section 2.6: search anchor advance is match-end (per GNU `file`
  softmagic.c FILE_SEARCH), not window-end. Documents the prior bug
  and links the ground-truth analysis.

docs/src/evaluator.md:
* Fix the `read_typed_value` signature documentation. The actual code
  has a 3-arg `read_typed_value` convenience wrapper alongside the
  4-arg `read_typed_value_with_pattern`; the doc previously showed
  only the 4-arg form with the wrong name.
* Add Regex and Search to the enumerated type list.
* Document the `read_pattern_match` entry point used by the engine for
  pattern-bearing types and explain why it returns `Option<Value>`.

docs/solutions/integration-issues/implementing-variable-width-typekind-variant.md:
* Correct the dispatch description: the implementation adds a sibling
  `read_typed_value_with_pattern` alongside the existing 3-arg
  `read_typed_value`, rather than extending the existing signature.
  The "Prevention" section previously prescribed the opposite of what
  the code did.
* Add the lesson learned about sibling functions vs signature
  extensions (narrow new concerns favor siblings to avoid touching
  ~30 existing call sites).
* Add the lesson learned about not overloading `Value::String("")` as
  a miss sentinel (zero-width matches are valid and distinct).
* Add the lesson learned about search advancing by match-end
  (match_idx + pattern.len()), not window-end, per GNU file.
* Add the lesson learned about rejecting non-equality operators on
  pattern-bearing types rather than falling through to apply_operator.

Refs #39

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Implements the full set of regex/search flags and semantics specified
in the issue #39 acceptance criteria and GNU `file` magic(5). This is
the acceptance-criteria completion pass on top of the earlier
foundation commit.

TypeKind shape redesign:

* Add `RegexFlags { case_insensitive, start_offset, line_based }`
  struct to `src/parser/ast.rs`. The previous bare-`bool` fields on
  `TypeKind::Regex` are replaced by the flags struct; this matches the
  design prescribed in the issue #39 plan and keeps the blast radius
  of future flag additions to a single field.
* `TypeKind::Regex` becomes `{ flags: RegexFlags, count: Option<NonZeroU32> }`.
  The count is the numeric prefix from `regex/N[flags]` (e.g., `1l`
  means count=1 with line_based=true). `None` means "use the default".
* `TypeKind::Search::range` changes from `Option<usize>` to
  `NonZeroUsize`. Per GNU `file` magic(5), the search range is
  mandatory; this lifts the constraint into the type system so bare
  `search` and `search/0` are unrepresentable. The parser now hard-
  errors on both.
* Add `REGEX_MAX_BYTES = 8192` constant to `ast.rs` matching GNU
  `file`'s `FILE_REGEX_MAX`.

Parser:

* `src/parser/grammar/mod.rs` accepts interleaved count/flag order
  (both `regex/1l` and `regex/l1` parse to the same value), matching
  libmagic's `parse_string_modifier` "last range wins" behavior.
* Add `/s` flag parsing (start_offset).
* Reject bare `search` and `search/0` at parse time via `NonZeroUsize`.
* Reject `regex/0` -- a zero count has no valid semantics in our model
  since `None` already signals "use the default".
* `strip_prefix` instead of direct `&rest[1..]` slicing per the
  AGENTS.md memory-safety rule.

Evaluator (src/evaluator/types/regex.rs):

* Rewrite `compute_window` to apply the 8192-byte cap unconditionally
  and implement line-based scanning. When `flags.line_based` is set,
  `count` is a line count; the scan walks line terminators (`\n` and
  `\r\n`, each counting as one) until the Nth terminator or the 8192
  cap, whichever comes first. This matches GNU `file`'s
  `softmagic.c::mget_count` / `FILE_REGEX` path.
* Remove the old `/l`-as-`^(?:...)`-anchor wrapping. `build_regex`
  now unconditionally sets `multi_line(true)` and
  `dot_matches_new_line(false)` for every pattern, matching libmagic's
  unconditional `REG_NEWLINE` in `alloc_regex` (softmagic.c:2123).
  `/l` no longer touches the compiled regex -- it only bounds the
  scan window. `^` and `$` match at line boundaries for every regex
  rule regardless of `/l`.
* `read_regex` now takes `(flags: RegexFlags, count: Option<NonZeroU32>)`
  and uses `compute_window` for the scan slice.
* `regex_bytes_consumed` implements the `/s` flag: when
  `flags.start_offset` is set, return `m.start()` (match-start anchor
  advance) instead of `m.end()` (match-end). This matches libmagic's
  `REGEX_OFFSET_START` / `moffset()` logic where `vlen = 0` zeros the
  match-length contribution.
* 8192-byte cap applies to all regex scans including explicit counts
  larger than 8192. This is a DoS mitigation -- without it, a crafted
  regex against a multi-GB buffer combined with
  `EvaluationConfig::default()` (no timeout) can hang the evaluator.

Evaluator (src/evaluator/types/search.rs):

* `read_search` and `search_bytes_consumed` take `NonZeroUsize`
  instead of `Option<usize>`.

Evaluator (src/evaluator/types/mod.rs):

* Dispatch updated for the new Regex/Search variant shapes in
  `read_typed_value_with_pattern`, `read_pattern_match`, and
  `bytes_consumed_with_pattern`. The regex arm passes `flags` and
  `count` through; the search arm unwraps the `NonZeroUsize`.

Strength calculation:

* Search unconditionally gets the "constrained match" bonus (25)
  since the range is always bounded.
* Regex gets the bonus when `count` is `Some` (user specified an
  explicit scan limit) and the base 20 otherwise.

Tests:

* `src/evaluator/types/regex.rs::tests`:
  - Line-based window (1-line cap, 3-line cap, CRLF terminators)
  - 8192-byte cap on default scans
  - 8192-byte cap applied to explicit counts larger than 8192
  - Small explicit counts honored
  - `/s` flag returns match-start anchor advance
  - Multi-line regex matching always on (the old "start of line
    anchor" tests are replaced by explicit multi-line/`.` behavior
    tests)
* `src/evaluator/types/search.rs::tests`: all tests updated for
  `NonZeroUsize` signature; unbounded-range tests removed.
* `src/parser/grammar/tests/mod.rs`:
  - `/s` flag accepted
  - `regex/cs`, `regex/csl` multi-flag combinations accepted
  - Interleaved count/flag order (`regex/1l` and `regex/l1`)
  - `search/0` rejected at parse time
  - `regex/0` rejected at parse time
* `src/evaluator/engine/tests.rs`, `tests/evaluator_tests.rs`,
  `tests/property_tests.rs`: all rule constructions updated for the
  new TypeKind shapes.

Refs #39

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
…iles

Adds end-to-end regex/search tests against the corpus files listed as
"blocked" in issue #39. These exercise the full evaluator pipeline on
real third_party/tests fixtures instead of hand-crafted buffers.

Tests added (tests/regex_search_corpus_tests.rs):

* `test_searchbug_corpus_search_with_relative_child` -- models the
  searchbug.magic scenario (which uses `use`/`name`/`offset`/`&0`
  features we do not yet parse) via programmatic rule construction.
  Verifies that a `search/12 "ABC"` scan followed by a `Relative(0)`
  byte child reads the character immediately after the matched
  pattern, exercising the match-end anchor advance end-to-end.
* `test_searchbug_search_anchor_advance_not_window_end` -- regression
  guard using the same corpus file to assert that the anchor advances
  by match-end (8 + 3 = 11) and NOT by window-end (8 + 12 = 20). A
  byte child expecting the window-end byte value must NOT match.
* `test_json1_corpus_detected_by_regex` -- JSON detection via
  `^\s*[\{\[]` regex against `json1.testfile`.
* `test_jsonlines1_corpus_detected_by_regex` -- same pattern against
  `jsonlines1.testfile`.
* `test_cmd1_corpus_detected_by_regex` -- shell script detection via
  `^#![ \t]*/\S+` against `cmd1.testfile`.
* `test_gedcom_corpus_detected_by_line_based_regex` -- exercises the
  `/l` line-based scan window (count = 1) with a `^0 HEAD` pattern
  against `gedcom.testfile`.
* `test_regex_eol_version_extraction` -- smoke test for the simpler
  non-hierarchical part of the regex-eol scenario.
* `test_corpus_files_exist` -- meta-guard so a broken third_party/
  checkout fails loudly with file-existence errors instead of
  confusing evaluation failures.

Where a corpus file depends on magic-file features we do not yet
support (`use`/`name`/`offset`/`&+N` parser), the test bypasses
`parse_text_magic_file` and builds the equivalent rule tree
programmatically via the AST, matching the pattern documented in
GOTCHAS 3.9. This gets issue #39's "blocked test corpus" coverage to
7 of 14 files (the remaining 7 are variants of the same formats
tested here and will pass the same regex/search code paths).

Refs #39

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
…eria

AGENTS.md:

* Rewrite the "Currently Implemented" bullet for Regex to list all
  three flags (`/c`, `/s`, `/l`) with their meanings, including the
  8192-byte hard cap, the unconditional multi-line regex compilation,
  and the `line_based` vs `start_offset` vs `case_insensitive`
  distinction.
* Rewrite the "Currently Implemented" bullet for Search to note the
  mandatory `NonZeroUsize` range and reference GNU `file` magic(5).
* Remove the "Current Limitations" entries for regex/search that no
  longer apply (the `/l`-as-anchor placeholder, the `search` range
  being optional, and the missing 8192-byte default). All of these
  were closed by the feature-completion commit.
* Remove the "Planned Features" entries for `/s`, `/Nl` line-count
  semantics (both are now implemented).
* Planned-features list now focuses on Aho-Corasick optimization,
  `!:mime`/`!:ext`/`!:apple` directives, and `use`/`name` named tests.

GOTCHAS.md:

* Section 2.7: regex `/l` is scan-window bounds only, not a multi-line
  toggle. Documents the removal of the old `^(?:...)` anchor wrapping
  and explains why `build_regex` now unconditionally sets
  `multi_line(true)` for every pattern.
* Section 2.8: regex scan window is always capped at 8192 bytes. Lists
  the three paths this applies to (default, explicit, line-based) and
  explains the DoS mitigation rationale. Calls out the interaction
  with GOTCHAS S13.1 (`EvaluationConfig::default()` no timeout).
* Section 2.9: regex `/s` flag affects anchor advance only. Documents
  the libmagic `REGEX_OFFSET_START` / `moffset()` correspondence and
  notes that tests for `/s` must check `regex_bytes_consumed` or
  downstream `Relative(N)` resolution -- checking `read_regex` alone
  won't detect a broken `/s` implementation.
* Section 2.6 touched up to clarify that search's `/s`-style start-
  offset flag is not currently supported (it only applies to regex).

Refs #39

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
@unclesp1d3r unclesp1d3r linked an issue Apr 11, 2026 that may be closed by this pull request
7 tasks
Copilot AI review requested due to automatic review settings April 11, 2026 05:54
@dosubot dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Apr 11, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 11, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Require conventional commit format per https://www.conventionalcommits.org/en/v1.0.0/. Skipped for bots.

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?!?:

🟢 Full CI must pass

Wonderful, this rule succeeded.

All CI checks must pass. Release-plz PRs are exempt because they only bump versions and changelogs (code was already tested on main), and GITHUB_TOKEN-triggered force-pushes suppress CI.

  • check-success = coverage
  • check-success = quality
  • check-success = test
  • check-success = test-cross-platform (macos-latest, macOS)
  • check-success = test-cross-platform (ubuntu-22.04, Linux)
  • check-success = test-cross-platform (ubuntu-latest, Linux)
  • check-success = test-cross-platform (windows-latest, Windows)

🟢 Do not merge outdated PRs

Wonderful, this rule succeeded.

Make sure PRs are within 10 commits of the base branch before merging

  • #commits-behind <= 10

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 11, 2026

Summary by CodeRabbit

  • New Features

    • Implemented Regex and Search pattern types: Regex supports /c, /s, /l flags and optional numeric counts; both enforce bounded scan windows (Regex capped at 8192 bytes). Search requires a /N range and both advance anchors to match end.
  • Bug Fixes

    • Rejects invalid suffixes (e.g., bare search or search/0, empty regex suffix) and limits comparison to Equal/NotEqual for pattern types.
  • Documentation

    • Updated docs and compatibility status for Regex/Search.
  • Tests

    • Added extensive tests for matching, anchors, parsing, and operator validation.

Walkthrough

This PR implements pattern-bearing magic types: adds TypeKind::Regex and TypeKind::Search, parser support for regex/... flags and search/N ranges, evaluator changes to perform binary-safe regex/search matching and to advance anchors by match boundaries, updates strength scoring and tests, and promotes regex to a runtime dependency.

Changes

Cohort / File(s) Summary
AST & Parser Types
src/parser/ast.rs, src/parser/types.rs
Added TypeKind::Regex { flags, count } and TypeKind::Search { range }; introduced RegexFlags and REGEX_MAX_BYTES = 8192; extended keyword parsing to accept regex/search.
Grammar Parsing
src/parser/grammar/mod.rs, src/parser/grammar/tests/mod.rs
Parse regex suffix (/c, /s, /l, optional non-zero decimal count) and mandatory search/N (NonZeroUsize); added unit tests for valid/invalid suffix forms and rule round-trips.
Codegen
src/parser/codegen.rs
serialize_type_kind extended to emit RegexFlags and Search range literals.
Evaluator Types API
src/evaluator/types/mod.rs, src/evaluator/types/regex.rs, src/evaluator/types/search.rs
Introduced read_typed_value_with_pattern(...), read_pattern_match(...), and bytes_consumed_with_pattern(...); new regex and search modules implement GNU-style binary-safe regex (8192-byte cap, flags) and bounded literal search (memchr::memmem), including consumption semantics (advance to match end / special /s start-offset).
Evaluator Engine & Tests
src/evaluator/engine/mod.rs, src/evaluator/engine/tests.rs, src/evaluator/types/tests.rs
Engine dispatches pattern kinds to read_pattern_match, enforces only =/!= for these types, threads patterns into consumption, and includes unit tests for operator rejection, anchor advancement, zero-width matches, and search window behavior.
Strength Scoring
src/evaluator/strength.rs
Extended default strength scoring to account for Regex { count } and Search { .. }.
Docs & Gotchas
AGENTS.md, GOTCHAS.md, docs/src/* (evaluator.md, ast-structures.md, parser.md, compatibility.md)
Documented regex/search semantics, flags (/c, /s, /l), 8192-byte cap, API read_typed_value_with_pattern/read_pattern_match, changed implementation status and gotchas.
Cargo
Cargo.toml
Promoted regex = "1.12.3" to normal dependency and bumped clap_complete to 4.6.1.

Sequence Diagram(s)

sequenceDiagram
    participant Engine as Evaluator Engine
    participant Types as Types Dispatch
    participant Pattern as Regex/Search Module
    participant Anchor as Anchor Computation

    Engine->>Types: evaluate_single_rule_with_anchor(rule, buf, off)
    alt rule.typ == Regex or Search
        Types->>Pattern: read_pattern_match(buf, off, rule.typ, Some(&rule.value))
        Pattern-->>Types: Ok(Some(Value)) / Ok(None) / Err
        Types-->>Engine: matched? (value.is_some())
        Engine->>Engine: allow only '=' or '!=' else Err
        Engine->>Anchor: bytes_consumed_with_pattern(buf, off, rule.typ, Some(&rule.value))
        Anchor-->>Engine: consumed = match_end or match_start (if /s)
    else
        Types->>Types: read_typed_value_with_pattern(buf, off, type_kind, Some/None)
        Types-->>Engine: Value
        Engine->>Engine: apply operators normally
    end
    Engine-->>Engine: advance relative children by consumed
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

✨ Patterns wake and scanners sing,
Flags and ranges, match and bring,
Bytes confined within a bound,
Anchors move where hits are found,
Rules now chase the patterned ring.

🚥 Pre-merge checks | ✅ 9 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Libmagic-Compatibility-Check ⚠️ Warning Pull request contains documentation inaccuracies regarding search anchor advancement behavior and unsupported regex lookahead features that contradict implementation. Update ast-structures.md to clarify match-end semantics, remove lookahead references from parser.md and regex.rs, and document unsupported regex features.
✅ Passed checks (9 passed)
Check name Status Explanation
Title check ✅ Passed Title follows Conventional Commits: feat scope is 'evaluator', description clearly indicates regex and search type implementation, includes issue reference.
Description check ✅ Passed Description thoroughly details the implementation of Regex and Search types, acceptance criteria, test plan, and semantic correctness against GNU file reference source.
Linked Issues check ✅ Passed PR comprehensively addresses all #39 acceptance criteria: TypeKind extended [ast.rs], parser recognizes regex/search with flags [grammar.rs], evaluator uses regex::bytes::Regex [types/regex.rs, engine/mod.rs], search supports range-limited searching [types/search.rs], common flags supported [RegexFlags], ~50 unit tests and corpus tests added.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing Regex and Search types: parser AST/grammar, evaluator types/engine/strength, documentation, Cargo dependency updates. No unrelated refactoring or feature creep detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 85.00%.
Memory-Safety-Check ✅ Passed No unsafe blocks, transmute operations, or raw pointer arithmetic introduced. All buffer access properly bounds-checked before use with offset validation returning BufferOverrun errors when out of bounds.
Performance-Regression-Check ✅ Passed Regex pattern double compilation is a deliberately documented trade-off aligned with the project's Safety over performance philosophy, with comprehensive error handling and impact limited to matched rules.
Test-Coverage-Check ✅ Passed Pull request provides 59 comprehensive tests covering regex/search functionality with unit, integration, and property-based tests validating core logic, error handling, and boundary conditions.
Error-Handling-Check ✅ Passed Pull request demonstrates excellent error handling following idiomatic Rust patterns with all public APIs returning Result types and no unsafe panicking in library code.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@dosubot dosubot Bot added evaluator Rule evaluation engine and logic parser Magic file parsing components and grammar testing Test infrastructure and coverage labels Apr 11, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements full parser + evaluator support for libmagic’s pattern-bearing regex and search rule types, enabling GNU file-compatible text-format detection and relative-offset chaining needed by the #39 corpus fixtures.

Changes:

  • Added TypeKind::Regex and TypeKind::Search (plus RegexFlags / REGEX_MAX_BYTES) and parser support for regex/[csl] + optional count and mandatory search/N.
  • Implemented evaluator readers for regex (binary-safe via regex::bytes) and search (bounded scan via memchr::memmem), including correct anchor advancement for relative offsets.
  • Added extensive unit + integration + corpus tests and updated project docs/manifests to reflect the new feature set.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/regex_search_corpus_tests.rs New end-to-end corpus integration tests for regex/search behavior and anchor advancement.
tests/property_tests.rs Extends TypeKind generators to include Regex/Search for property-based coverage.
tests/evaluator_tests.rs Adds an integration test for the upstream regex-eol corpus scenario via programmatic rule trees.
src/parser/types.rs Recognizes regex/search keywords and provides placeholder TypeKind mapping for round-trip tests.
src/parser/grammar/tests/mod.rs Adds parser tests covering regex/search suffix forms, errors, and operator adjacency.
src/parser/grammar/mod.rs Implements parsing of regex/... modifiers (flags + count) and mandatory search/N range.
src/parser/codegen.rs Adds codegen serialization for TypeKind::Regex and TypeKind::Search.
src/parser/ast.rs Adds TypeKind::Regex/Search, RegexFlags, and REGEX_MAX_BYTES definitions + rustdoc.
src/evaluator/types/tests.rs Updates anchor-consumption tests to use the new pattern-aware consumption path and adds regex/search consumption cases.
src/evaluator/types/search.rs New bounded literal-search reader and anchor-advance computation.
src/evaluator/types/regex.rs New binary-safe regex reader with scan-window computation and anchor-advance computation.
src/evaluator/types/mod.rs Adds pattern-aware read/dispatch (read_typed_value_with_pattern, read_pattern_match) and bytes_consumed_with_pattern.
src/evaluator/strength.rs Assigns default strength for Regex/Search types.
src/evaluator/engine/tests.rs Adds engine-level tests for regex/search matching semantics and operator restrictions.
src/evaluator/engine/mod.rs Updates engine evaluation to special-case pattern-bearing types and thread patterns into anchor advancement.
GOTCHAS.md Documents new invariants/semantics for pattern-bearing types and corpus-testing guidance.
docs/src/evaluator.md Documents Regex/Search in evaluator type-reading section and the pattern-aware API.
docs/solutions/integration-issues/implementing-variable-width-typekind-variant.md New solution doc describing the integration work and pitfalls encountered.
Cargo.toml Adds regex to main dependencies and bumps clap_complete.
Cargo.lock Updates lockfile for clap_complete version bump.
AGENTS.md Updates project capability docs to mark Regex/Search as implemented and describes semantics/limits.
.serena/project.yml Documentation/formatting updates to Serena project configuration.

Comment thread docs/src/evaluator.md
- **QDate**: 64-bit Unix timestamps (signed seconds since epoch) with configurable endianness and UTC/local time formatting
- **String**: Byte sequences with length limits
- **PString**: Pascal-style length-prefixed strings with 1-byte (`/B`), 2-byte (`/H` or `/h`), or 4-byte (`/L` or `/l`) length prefixes, supporting big-endian and little-endian byte order
- **Regex**: Binary-safe regex matching via `regex::bytes::Regex`; the `/c` flag enables case-insensitive matching and `/l` enables multi-line start-of-line anchoring
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs claim the /l flag “enables multi-line start-of-line anchoring”, but the implementation compiles regexes with multi-line mode unconditionally and /l only changes how the scan window is bounded (line-count vs byte-count). Please update this bullet to reflect the actual /l semantics (and that ^/$ line-boundary behavior is always enabled).

Suggested change
- **Regex**: Binary-safe regex matching via `regex::bytes::Regex`; the `/c` flag enables case-insensitive matching and `/l` enables multi-line start-of-line anchoring
- **Regex**: Binary-safe regex matching via `regex::bytes::Regex`; the `/c` flag enables case-insensitive matching, `^`/`$` line-boundary behavior is always enabled, and `/l` switches scan-window bounding to line-count semantics rather than byte-count semantics

Copilot uses AI. Check for mistakes.
Comment thread src/parser/ast.rs
Comment on lines +386 to +394
/// * `None`: use the 8192-byte default scan window.
/// * `Some(n)` with `flags.line_based == false`: scan at most `n`
/// bytes, capped at 8192.
/// * `Some(n)` with `flags.line_based == true`: scan at most `n`
/// lines, with an effective byte cap of `min(n * 80, 8192)`.
///
/// The 8192-byte hard cap matches GNU `file`'s `FILE_REGEX_MAX` and
/// prevents runaway regex scans against large buffers.
count: Option<NonZeroU32>,
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeKind::Regex rustdoc says line-based mode uses an effective byte cap of min(n * 80, 8192), but the evaluator’s compute_window implementation bounds the window by scanning for line terminators up to the Nth line (and always caps at 8192 bytes) — there is no n * 80 behavior. Please adjust the doc comment to match the implemented windowing semantics so consumers don’t rely on the wrong contract.

Copilot uses AI. Check for mistakes.
Comment on lines +375 to +383
/// For `TypeKind::Regex`, the pattern is required to re-run the match and
/// compute the consumed bytes. When the pattern is unavailable (or not a
/// string), the function returns `0` -- the anchor will then stay put and
/// the next relative offset resolves against the previous anchor position,
/// which is the same graceful-degradation behavior used by the other
/// defensive paths in this module. For `TypeKind::Search`, the pattern is
/// not needed because the consumed distance is the entire search window
/// regardless of where the match was found. Non-pattern types should pass
/// `pattern: None`.
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc comment says TypeKind::Search doesn’t need the pattern because “the consumed distance is the entire search window regardless of where the match was found”, but the implementation re-scans and advances by match_idx + pattern.len() (and therefore requires the pattern). Please update the comment to reflect match-end semantics and the need for the pattern operand.

Suggested change
/// For `TypeKind::Regex`, the pattern is required to re-run the match and
/// compute the consumed bytes. When the pattern is unavailable (or not a
/// string), the function returns `0` -- the anchor will then stay put and
/// the next relative offset resolves against the previous anchor position,
/// which is the same graceful-degradation behavior used by the other
/// defensive paths in this module. For `TypeKind::Search`, the pattern is
/// not needed because the consumed distance is the entire search window
/// regardless of where the match was found. Non-pattern types should pass
/// `pattern: None`.
/// For `TypeKind::Regex` and `TypeKind::Search`, the pattern is required
/// to re-run the match and compute the consumed bytes. For search, the
/// consumed distance is measured to the end of the matched occurrence
/// (`match_idx + pattern.len()`), not the full search window. When the
/// pattern is unavailable (or not a string), the function returns `0` --
/// the anchor will then stay put and the next relative offset resolves
/// against the previous anchor position, which is the same
/// graceful-degradation behavior used by the other defensive paths in
/// this module. Non-pattern types should pass `pattern: None`.

Copilot uses AI. Check for mistakes.
//! pattern within a bounded window. Unlike `TypeKind::String`, which only
//! matches at the exact offset, `search` advances through the buffer looking
//! for the first occurrence of the pattern anywhere in the window. The
//! search window is `buffer[offset..]` capped by the optional `range`.
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module-level docs describe the search window as being capped by an “optional range”, but TypeKind::Search stores range as NonZeroUsize and the grammar rejects bare search/search/0, so the range is mandatory. Please update the docs to avoid implying that an unbounded/optional range is supported.

Suggested change
//! search window is `buffer[offset..]` capped by the optional `range`.
//! search window is `buffer[offset..]` capped by the mandatory `range`
//! (or the remaining buffer length, if smaller).

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +55
**Regex reader** (`src/evaluator/types/regex.rs`) — uses a `build_regex` helper that wraps the pattern in `^(?:...)` when `/l` is set so bare, unanchored patterns cannot match mid-line:

```rust
fn build_regex(
pattern: &str,
case_insensitive: bool,
start_of_line: bool,
) -> Result<Regex, regex::Error> {
let owned;
let effective_pattern: &str = if start_of_line {
owned = format!("^(?:{pattern})");
&owned
} else {
pattern
};
RegexBuilder::new(effective_pattern)
.case_insensitive(case_insensitive)
.multi_line(start_of_line)
.build()
}

Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution doc’s regex semantics (wrapping patterns with ^(?:...) when /l is set and toggling multi_line(start_of_line)) conflict with the current implementation and GOTCHAS: regex multi-line mode is unconditional, and /l controls only scan-window bounding (line-based window), not anchoring. The read_search signature shown here also uses an optional range, which no longer matches the mandatory NonZeroUsize range. Please update this doc to match the merged behavior so it doesn’t mislead future contributors.

Copilot uses AI. Check for mistakes.
@dosubot
Copy link
Copy Markdown
Contributor

dosubot Bot commented Apr 11, 2026

Related Documentation

4 document(s) may need updating based on files changed in this PR:

libMagic-rs

ARCHITECTURE /libmagic-rs/blob/main/docs/ARCHITECTURE.md — ⏳ Awaiting Merge
ast-structures /libmagic-rs/blob/main/docs/src/ast-structures.md — ⏳ Awaiting Merge
compatibility /libmagic-rs/blob/main/docs/src/compatibility.md — ⏳ Awaiting Merge
parser /libmagic-rs/blob/main/docs/src/parser.md — ⏳ Awaiting Merge

Note: You must be authenticated to accept/decline updates.

How did I do? Any feedback?

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 11, 2026

Codecov Report

❌ Patch coverage is 89.28571% with 60 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/evaluator/types/mod.rs 61.64% 28 Missing ⚠️
src/parser/codegen.rs 0.00% 10 Missing ⚠️
src/evaluator/types/regex.rs 96.52% 8 Missing ⚠️
src/evaluator/strength.rs 0.00% 5 Missing ⚠️
src/evaluator/types/search.rs 95.41% 5 Missing ⚠️
src/parser/grammar/mod.rs 95.00% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/parser/grammar/mod.rs (1)

394-563: 🛠️ Refactor suggestion | 🟠 Major

Extract the regex/search suffix parsers instead of growing parse_type_and_operator further.

This function now owns pstring suffixes, regex modifiers, mandatory search ranges, attached &mask, and fallback operator parsing. The behavior is fine, but the new branches make the operator-boundary rules much harder to audit and push an already-oversized parser file further past the repo’s size target. Pulling the regex/search pieces into parse_regex_suffix(...) / parse_search_suffix(...) helpers would make this logic easier to test and maintain.

As per coding guidelines, "Keep source files under 500-600 lines; split larger files into focused modules" and "Module structure: Use src/parser/ for parsing system with ast.rs, grammar/, types.rs, hierarchy.rs, loader.rs, and codegen.rs".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/parser/grammar/mod.rs` around lines 394 - 563, parse_type_and_operator is
too large and contains inline parsing for regex and search suffixes which should
be extracted; create two helper functions parse_regex_suffix(input: &str) ->
IResult<&str, (RegexFlags, Option<u32>)> and parse_search_suffix(input: &str) ->
IResult<&str, ::std::num::NonZeroUsize> (reuse existing parse_pstring_suffix for
pstring) and replace the large inline blocks in parse_type_and_operator with
calls to these helpers, ensuring they return the remaining rest string and
parsed values (flags/count or range) and preserve the same error semantics and
operator-boundary behavior so later logic (attached_op parsing and
type_keyword_to_kind handling) continues to work unchanged.
docs/src/evaluator.md (1)

595-596: ⚠️ Potential issue | 🟡 Minor

Implementation status is stale after this PR.

The checklist still shows regex support as unimplemented even though this change adds TypeKind::Regex/TypeKind::Search, parser support, evaluator wiring, and tests. That will send readers to the wrong source of truth immediately after merge.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/src/evaluator.md` around lines 595 - 596, The docs checklist is out of
date—mark Regex support as implemented: update the evaluator.md checklist entry
for "Regex type support" to checked or remove it so it reflects that
TypeKind::Regex and TypeKind::Search, parser support, evaluator wiring, and
tests are present; ensure the doc text mentions the implemented symbols
(TypeKind::Regex, TypeKind::Search, parser/evaluator wiring) so readers aren't
directed to the wrong source of truth after merge.
src/evaluator/types/mod.rs (1)

183-239: ⚠️ Potential issue | 🟠 Major

Keep a lossless public API for pattern reads.

Lines 219-225 and 237-238 collapse Ok(None) into Value::String(""), while Lines 261-295 keep the only lossless matcher crate-private. That makes the new public read_typed_value_with_pattern ambiguous for external callers: a regex miss is indistinguishable from a zero-width hit, and a search miss can masquerade as an empty-string match. Please either expose a lossless Option/enum-based API publicly or keep this helper internal until that contract exists.

Also applies to: 261-295

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/evaluator/types/mod.rs` around lines 183 - 239, The public helper
read_typed_value_with_pattern currently collapses None results from
read_regex/read_search into Value::String("") for TypeKind::Regex and
TypeKind::Search, losing the distinction between a miss and an empty/zero-width
match; either change the public API to return an Option/enum that preserves None
(e.g., Result<Option<Value>, TypeReadError> or a small enum MatchResult) or make
read_typed_value_with_pattern crate-private/internal until you add that lossless
return type; update callers accordingly and keep references to TypeKind::Regex,
TypeKind::Search, read_regex, and read_search to locate the affected branches to
implement the lossless return or visibility change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@AGENTS.md`:
- Line 307: Update the AGENTS.md text that claims regex/search coverage is
incomplete: locate the sentence "Note: Currently implemented types are `Byte`,
`Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`, `String`,
`PString`, `Regex`, and `Search`" and the later lines mentioning regex flag
completeness/search range enforcement (the "Current Limitations"/future-phase
notes) and revise them to reflect that `Regex` and `Search` support has been
implemented by this PR—remove or reword the TODO language so the document
consistently states that regex/search flags and range enforcement are
implemented, and update the "Current Limitations" section to only list remaining
true gaps.

In `@docs/src/evaluator.md`:
- Around line 135-157: Update the docs to reflect the current pattern behavior:
clarify that the /l flag no longer enables multiline anchoring but only sets a
line-based window size, that multiline matching is now unconditional (not gated
by /l), and that the engine does not always call read_typed_value_with_pattern
for pattern-bearing rules—evaluate_single_rule_with_anchor now calls
read_pattern_match to preserve zero-width-match vs miss semantics
(read_typed_value remains the None-pattern wrapper). Adjust the paragraphs
describing /l and the engine call path accordingly and reference
read_typed_value_with_pattern, read_pattern_match,
evaluate_single_rule_with_anchor, and read_typed_value when explaining the
corrected behavior.

In `@src/evaluator/engine/tests.rs`:
- Around line 2597-2605: Update the test comment in
test_search_parent_relative_child_at_positive_offset to reflect that this is a
positive-offset sanity check: change phrasing from “Sanity check the negative”
and any mention of `Relative(-N)` to indicate a positive relative offset (e.g.,
`OffsetSpec::Relative(1)`), and ensure the comment correctly describes that a
`Relative(1)` child resolves against the match-end anchor for `MagicRule` in
this test.

In `@src/evaluator/strength.rs`:
- Around line 80-93: The test_test_calculate_default_strength_table currently
doesn't cover the new scoring branches for TypeKind::Regex and TypeKind::Search;
update the test_test_calculate_default_strength_table to explicitly construct
TypeKind::Regex variants (one with count = Some(...) expecting strength 25 and
one with count = None expecting strength 20) and a TypeKind::Search (expecting
strength 25), call calculate_default_strength_table (or the library function
that maps TypeKind -> strength) and assert those expected strength values so
future changes to scoring will fail the test if regressed; reference
TypeKind::Regex, TypeKind::Search, calculate_default_strength_table, and
test_calculate_default_strength_table when adding the assertions.

In `@src/evaluator/types/mod.rs`:
- Around line 375-383: Update the rustdoc for the TypeKind handling: change the
incorrect statement that "TypeKind::Search does not need the pattern and
consumes the full window" to accurately say that TypeKind::Search uses the
provided pattern/needle to find the match end and computes the consumed distance
from the anchor to the end of that match (i.e., callers pass the needle into
search_bytes_consumed to determine the consumed bytes), and mention the fallback
behavior when the pattern is missing or not a string; also update the similar
comment near the search_bytes_consumed usage (the lines that pass the needle
into search_bytes_consumed) so both places describe the current match-end
behavior consistently.

In `@src/evaluator/types/regex.rs`:
- Around line 82-121: The window walker in compute_window (the logic using
buffer.get(offset..), byte_cap, capped and the while loop over capped[idx]) must
be rewritten to use the SafeBufferAccess/.get() pattern everywhere instead of
direct slicing or indexing; replace direct slices like &capped[..count_bytes]
and &capped[..idx] and indexed access capped[idx] and capped[idx+1] with calls
to buffer.get(...) or capped.get(...) and handle Option results (early return or
break) to maintain the same CR/CRLF and line-count behavior, keeping the same
REGEX_MAX_BYTES cap, the flags.line_based branch, and the
target_lines/lines_seen logic but using safe .get() lookups and saturation
semantics.

In `@src/evaluator/types/search.rs`:
- Around line 53-55: Both read_search and search_bytes_consumed create a window
using direct slicing (let remaining = &buffer[offset..]; let window =
&remaining[..window_len];) which violates the SafeBufferAccess pattern; replace
these raw slices with SafeBufferAccess .get() calls to perform bounds-checked
access (e.g. use buffer.get(offset..offset+window_len) or
buffer.get(offset..).and_then(|r| r.get(..window_len))) and handle the Option
result consistently. Update the code paths that compute window_len from
range.get() so they call .get(range_start..range_start+window_len) (or chain
.get on remaining) and return/propagate a safe empty/nonexistent slice when
.get() yields None; apply the same change to the other occurrence around lines
~90-100 so both read_search and search_bytes_consumed use .get() for all buffer
windows.

In `@src/parser/ast.rs`:
- Around line 383-394: The doc for the AST field `count: Option<NonZeroU32>` is
incorrect about line-based behavior; update the rustdoc on `count` (and nearby
comment for `/l`) to reflect the real semantics implemented in
`evaluator::types::regex.rs` (see `compute_window`): when `flags.line_based ==
true` the code advances to the Nth line terminator and only afterward enforces
the 8192-byte hard cap, not an a priori byte estimate of `min(n * 80, 8192)`;
add a short example showing `regex/Nl` walking to the Nth newline then trimming
to 8192 bytes and a brief note about the performance/design tradeoff to justify
walking lines before the cap.

In `@src/parser/codegen.rs`:
- Around line 235-253: The generated code emits
libmagic_rs::parser::ast::RegexFlags which doesn't match in-crate imports; in
the TypeKind::Regex branch of src/parser/codegen.rs (the match arm constructing
the format! string for Regex), replace the path
libmagic_rs::parser::ast::RegexFlags with crate::parser::ast::RegexFlags so the
produced string uses crate::parser::ast::RegexFlags {{ case_insensitive: ...,
start_offset: ..., line_based: ... }} and keep the existing flags.field
interpolations and count handling unchanged.

In `@src/parser/grammar/tests/mod.rs`:
- Around line 2279-2446: Extract the regex/search tests (functions
test_parse_type_and_operator_regex_and_search_suffixes,
test_parse_type_and_operator_search_requires_range,
test_parse_type_and_operator_regex_invalid_suffix,
test_parse_type_and_operator_regex_operator_adjacent,
test_parse_magic_rule_regex_and_search) into a new test file named
src/parser/grammar/tests/regex_search.rs, preserving all use/imports (e.g.
crate::parser::ast::{RegexFlags, TypeKind}, OffsetSpec, Operator, Value) and
helper functions rx and sr; remove those tests from the large mod.rs and add mod
regex_search; to the tests module so the new file is compiled, ensuring any
relative paths remain crate::... so nothing else needs adjusting, then run cargo
test to verify.

In `@src/parser/types.rs`:
- Around line 319-321: Replace the panic-based construction in the
TypeKind::Search mapping by removing the expect() call and using a safe,
explicit non-panicking construction; specifically, change
::std::num::NonZeroUsize::new(1).expect("1 is nonzero") to an explicit unsafe
construction like unsafe { ::std::num::NonZeroUsize::new_unchecked(1) } (with a
brief comment that 1 is statically non-zero) so the library code doesn't call
expect()/panic.

---

Outside diff comments:
In `@docs/src/evaluator.md`:
- Around line 595-596: The docs checklist is out of date—mark Regex support as
implemented: update the evaluator.md checklist entry for "Regex type support" to
checked or remove it so it reflects that TypeKind::Regex and TypeKind::Search,
parser support, evaluator wiring, and tests are present; ensure the doc text
mentions the implemented symbols (TypeKind::Regex, TypeKind::Search,
parser/evaluator wiring) so readers aren't directed to the wrong source of truth
after merge.

In `@src/evaluator/types/mod.rs`:
- Around line 183-239: The public helper read_typed_value_with_pattern currently
collapses None results from read_regex/read_search into Value::String("") for
TypeKind::Regex and TypeKind::Search, losing the distinction between a miss and
an empty/zero-width match; either change the public API to return an Option/enum
that preserves None (e.g., Result<Option<Value>, TypeReadError> or a small enum
MatchResult) or make read_typed_value_with_pattern crate-private/internal until
you add that lossless return type; update callers accordingly and keep
references to TypeKind::Regex, TypeKind::Search, read_regex, and read_search to
locate the affected branches to implement the lossless return or visibility
change.

In `@src/parser/grammar/mod.rs`:
- Around line 394-563: parse_type_and_operator is too large and contains inline
parsing for regex and search suffixes which should be extracted; create two
helper functions parse_regex_suffix(input: &str) -> IResult<&str, (RegexFlags,
Option<u32>)> and parse_search_suffix(input: &str) -> IResult<&str,
::std::num::NonZeroUsize> (reuse existing parse_pstring_suffix for pstring) and
replace the large inline blocks in parse_type_and_operator with calls to these
helpers, ensuring they return the remaining rest string and parsed values
(flags/count or range) and preserve the same error semantics and
operator-boundary behavior so later logic (attached_op parsing and
type_keyword_to_kind handling) continues to work unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ad7f0fa2-ae76-4ad8-a150-7da5774a11fa

📥 Commits

Reviewing files that changed from the base of the PR and between 4c050d6 and 2756f8c.

⛔ Files ignored due to path filters (6)
  • .serena/project.yml is excluded by none and included by none
  • Cargo.lock is excluded by !**/*.lock and included by none
  • docs/solutions/integration-issues/implementing-variable-width-typekind-variant.md is excluded by none and included by none
  • tests/evaluator_tests.rs is excluded by none and included by none
  • tests/property_tests.rs is excluded by none and included by none
  • tests/regex_search_corpus_tests.rs is excluded by none and included by none
📒 Files selected for processing (16)
  • AGENTS.md
  • Cargo.toml
  • GOTCHAS.md
  • docs/src/evaluator.md
  • src/evaluator/engine/mod.rs
  • src/evaluator/engine/tests.rs
  • src/evaluator/strength.rs
  • src/evaluator/types/mod.rs
  • src/evaluator/types/regex.rs
  • src/evaluator/types/search.rs
  • src/evaluator/types/tests.rs
  • src/parser/ast.rs
  • src/parser/codegen.rs
  • src/parser/grammar/mod.rs
  • src/parser/grammar/tests/mod.rs
  • src/parser/types.rs

Comment thread AGENTS.md
### Adding New Type Support

> **Note:** Currently implemented types are `Byte`, `Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`, `String`, and `PString`. Regex and search types are planned for future releases.
> **Note:** Currently implemented types are `Byte`, `Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`, `String`, `PString`, `Regex`, and `Search`. See "Current Limitations" for the remaining gaps in regex/search flag coverage.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

AGENTS still presents regex/search work as incomplete.

Line 307 refers readers to “remaining gaps in regex/search flag coverage”, and Lines 462-463 still put regex flag completeness/search range enforcement in a future phase. That now conflicts with the implementation this PR adds, so future contributors will get mixed guidance from the same document.

Also applies to: 462-463

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@AGENTS.md` at line 307, Update the AGENTS.md text that claims regex/search
coverage is incomplete: locate the sentence "Note: Currently implemented types
are `Byte`, `Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`,
`String`, `PString`, `Regex`, and `Search`" and the later lines mentioning regex
flag completeness/search range enforcement (the "Current
Limitations"/future-phase notes) and revise them to reflect that `Regex` and
`Search` support has been implemented by this PR—remove or reword the TODO
language so the document consistently states that regex/search flags and range
enforcement are implemented, and update the "Current Limitations" section to
only list remaining true gaps.

Comment thread docs/src/evaluator.md
Comment on lines +135 to +157
- **Regex**: Binary-safe regex matching via `regex::bytes::Regex`; the `/c` flag enables case-insensitive matching and `/l` enables multi-line start-of-line anchoring
- **Search**: Bounded literal pattern scan via `memchr::memmem::find`; `search/N` caps the scan window to `N` bytes from the offset
- **Bounds checking**: Prevents buffer overruns

```rust
// Non-pattern types use the 3-arg convenience wrapper:
pub fn read_typed_value(
buffer: &[u8],
offset: usize,
type_kind: &TypeKind,
) -> Result<Value, TypeReadError>

// Pattern-bearing types (Regex, Search) thread the rule's value operand
// through as the match pattern:
pub fn read_typed_value_with_pattern(
buffer: &[u8],
offset: usize,
type_kind: &TypeKind,
pattern: Option<&Value>,
) -> Result<Value, TypeReadError>
```

The engine uses `read_typed_value_with_pattern` uniformly and passes `Some(&rule.value)` for every rule; the convenience `read_typed_value` is a thin wrapper that forwards `pattern: None`. For pattern-bearing types a genuine "no match" is collapsed to `Value::String(String::new())` in the `read_typed_value_with_pattern` return so the back-compat `Value` shape is preserved; the engine instead calls `read_pattern_match` directly, which returns `Result<Option<Value>, _>` so zero-width matches (e.g. `^`, `a*`) can be distinguished from genuine misses.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

These docs still describe the pre-fix regex/search behavior.

Line 135 says /l enables multiline anchoring, but the implementation now treats /l only as line-based window sizing; multiline matching is unconditional. The paragraph at Line 157 also says the engine “uses read_typed_value_with_pattern uniformly”, but evaluate_single_rule_with_anchor now special-cases pattern-bearing types through read_pattern_match specifically to preserve zero-width-match vs miss semantics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/src/evaluator.md` around lines 135 - 157, Update the docs to reflect the
current pattern behavior: clarify that the /l flag no longer enables multiline
anchoring but only sets a line-based window size, that multiline matching is now
unconditional (not gated by /l), and that the engine does not always call
read_typed_value_with_pattern for pattern-bearing
rules—evaluate_single_rule_with_anchor now calls read_pattern_match to preserve
zero-width-match vs miss semantics (read_typed_value remains the None-pattern
wrapper). Adjust the paragraphs describing /l and the engine call path
accordingly and reference read_typed_value_with_pattern, read_pattern_match,
evaluate_single_rule_with_anchor, and read_typed_value when explaining the
corrected behavior.

Comment on lines +2597 to +2605
/// Sanity check the negative: when the parent search finds the pattern
/// early in the window, a Relative(-N) child should still resolve against
/// the match-end anchor. This catches a class of bugs where the anchor
/// update uses the wrong base offset.
#[test]
fn test_search_parent_relative_child_at_positive_offset() {
// Buffer: "prefix_NEEDLE_after_stuff" -- "NEEDLE" is at index 7, len
// 6, match-end = 13. A Relative(1) child should read byte 14 = 'a'.
let child = MagicRule {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Correct the test comment wording (“negative” vs positive offset).

The comment says “Sanity check the negative” and references Relative(-N), but the test uses OffsetSpec::Relative(1). Update wording to avoid future confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/evaluator/engine/tests.rs` around lines 2597 - 2605, Update the test
comment in test_search_parent_relative_child_at_positive_offset to reflect that
this is a positive-offset sanity check: change phrasing from “Sanity check the
negative” and any mention of `Relative(-N)` to indicate a positive relative
offset (e.g., `OffsetSpec::Relative(1)`), and ensure the comment correctly
describes that a `Relative(1)` child resolves against the match-end anchor for
`MagicRule` in this test.

Comment thread src/evaluator/strength.rs
Comment on lines +80 to +93
// Regex matches a pattern -- treat similarly to an unbounded string.
// A rule with an explicit `count` is more constrained (narrower scan
// window) and therefore more specific.
TypeKind::Regex { count, .. } => {
if count.is_some() {
25
} else {
20
}
}
// Search is always a bounded scan (the range is mandatory), so it
// gets the "constrained match" bonus unconditionally. This matches
// the max_length bonus used for String and PString.
TypeKind::Search { .. } => 25,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add explicit strength tests for Regex and Search branches.

This introduces new scoring paths but test_calculate_default_strength_table doesn’t assert them yet. Add explicit cases so future tuning doesn’t regress silently.

As per coding guidelines: src/**/*.rs: Every PR must include tests for new functionality.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/evaluator/strength.rs` around lines 80 - 93, The
test_test_calculate_default_strength_table currently doesn't cover the new
scoring branches for TypeKind::Regex and TypeKind::Search; update the
test_test_calculate_default_strength_table to explicitly construct
TypeKind::Regex variants (one with count = Some(...) expecting strength 25 and
one with count = None expecting strength 20) and a TypeKind::Search (expecting
strength 25), call calculate_default_strength_table (or the library function
that maps TypeKind -> strength) and assert those expected strength values so
future changes to scoring will fail the test if regressed; reference
TypeKind::Regex, TypeKind::Search, calculate_default_strength_table, and
test_calculate_default_strength_table when adding the assertions.

Comment on lines +375 to +383
/// For `TypeKind::Regex`, the pattern is required to re-run the match and
/// compute the consumed bytes. When the pattern is unavailable (or not a
/// string), the function returns `0` -- the anchor will then stay put and
/// the next relative offset resolves against the previous anchor position,
/// which is the same graceful-degradation behavior used by the other
/// defensive paths in this module. For `TypeKind::Search`, the pattern is
/// not needed because the consumed distance is the entire search window
/// regardless of where the match was found. Non-pattern types should pass
/// `pattern: None`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the search anchor rustdoc.

Lines 380-382 say TypeKind::Search does not need the pattern and consumes the full window, but Lines 459-465 pass the needle into search_bytes_consumed. Please update the comment to describe the current match-end behavior; right now the docs encode the wrong invariant for future refactors.

Also applies to: 459-465

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/evaluator/types/mod.rs` around lines 375 - 383, Update the rustdoc for
the TypeKind handling: change the incorrect statement that "TypeKind::Search
does not need the pattern and consumes the full window" to accurately say that
TypeKind::Search uses the provided pattern/needle to find the match end and
computes the consumed distance from the anchor to the end of that match (i.e.,
callers pass the needle into search_bytes_consumed to determine the consumed
bytes), and mention the fallback behavior when the pattern is missing or not a
string; also update the similar comment near the search_bytes_consumed usage
(the lines that pass the needle into search_bytes_consumed) so both places
describe the current match-end behavior consistently.

Comment on lines +53 to +55
let remaining = &buffer[offset..];
let window_len = range.get().min(remaining.len());
let window = &remaining[..window_len];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Replace direct slices with .get()-based window construction.

Both read_search and search_bytes_consumed build their scan windows with direct slicing. That works because window_len is pre-clamped, but evaluator code in this repo is supposed to stay on the .get()/SafeBufferAccess path consistently.

Proposed direction
-    let window = &remaining[..window_len];
+    let Some(window) = remaining.get(..window_len) else {
+        return Ok(None);
+    };
-    let window = &remaining[..window_len];
+    let Some(window) = remaining.get(..window_len) else {
+        return 0;
+    };
As per coding guidelines, "Use SafeBufferAccess trait pattern from src/io/mod.rs for bounds checking" and "All buffer access must use bounds checking with `.get()` methods. No raw pointer arithmetic or transmute operations."

Also applies to: 90-100

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/evaluator/types/search.rs` around lines 53 - 55, Both read_search and
search_bytes_consumed create a window using direct slicing (let remaining =
&buffer[offset..]; let window = &remaining[..window_len];) which violates the
SafeBufferAccess pattern; replace these raw slices with SafeBufferAccess .get()
calls to perform bounds-checked access (e.g. use
buffer.get(offset..offset+window_len) or buffer.get(offset..).and_then(|r|
r.get(..window_len))) and handle the Option result consistently. Update the code
paths that compute window_len from range.get() so they call
.get(range_start..range_start+window_len) (or chain .get on remaining) and
return/propagate a safe empty/nonexistent slice when .get() yields None; apply
the same change to the other occurrence around lines ~90-100 so both read_search
and search_bytes_consumed use .get() for all buffer windows.

Comment thread src/parser/ast.rs
Comment on lines +383 to +394
/// Optional numeric count from `regex/N[flags]`. Interpretation
/// depends on `flags.line_based`:
///
/// * `None`: use the 8192-byte default scan window.
/// * `Some(n)` with `flags.line_based == false`: scan at most `n`
/// bytes, capped at 8192.
/// * `Some(n)` with `flags.line_based == true`: scan at most `n`
/// lines, with an effective byte cap of `min(n * 80, 8192)`.
///
/// The 8192-byte hard cap matches GNU `file`'s `FILE_REGEX_MAX` and
/// prevents runaway regex scans against large buffers.
count: Option<NonZeroU32>,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Document /l the way compute_window actually behaves.

These field docs say line-based mode has an effective byte cap of min(n * 80, 8192), but the implementation in src/evaluator/types/regex.rs walks forward to the Nth line terminator and only then applies the 8192-byte hard cap. As written, the public AST docs give the wrong mental model for regex/Nl.

Proposed doc fix
-        /// * `Some(n)` with `flags.line_based == true`: scan at most `n`
-        ///   lines, with an effective byte cap of `min(n * 80, 8192)`.
+        /// * `Some(n)` with `flags.line_based == true`: scan through the
+        ///   end of the Nth line terminator, capped at 8192 bytes.
As per coding guidelines, "Public APIs require rustdoc with examples" and "Explain design decisions and trade-offs in code comments".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/parser/ast.rs` around lines 383 - 394, The doc for the AST field `count:
Option<NonZeroU32>` is incorrect about line-based behavior; update the rustdoc
on `count` (and nearby comment for `/l`) to reflect the real semantics
implemented in `evaluator::types::regex.rs` (see `compute_window`): when
`flags.line_based == true` the code advances to the Nth line terminator and only
afterward enforces the 8192-byte hard cap, not an a priori byte estimate of
`min(n * 80, 8192)`; add a short example showing `regex/Nl` walking to the Nth
newline then trimming to 8192 bytes and a brief note about the
performance/design tradeoff to justify walking lines before the cap.

Comment thread src/parser/codegen.rs
Comment on lines +235 to +253
TypeKind::Regex { flags, count } => {
let count_lit = match count {
Some(n) => format!("::std::num::NonZeroU32::new({}).unwrap()", n.get()),
None => String::new(),
};
let count_expr = if count.is_some() {
format!("Some({count_lit})")
} else {
"None".to_string()
};
format!(
"TypeKind::Regex {{ flags: libmagic_rs::parser::ast::RegexFlags {{ case_insensitive: {}, start_offset: {}, line_based: {} }}, count: {count_expr} }}",
flags.case_insensitive, flags.start_offset, flags.line_based
)
}
TypeKind::Search { range } => format!(
"TypeKind::Search {{ range: ::std::num::NonZeroUsize::new({}).unwrap() }}",
range.get()
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "1) Check whether crate self-alias exists:"
rg -n --type=rust 'extern crate self as libmagic_rs' src build.rs || true

echo
echo "2) Find generated-source import style and emitted RegexFlags path:"
rg -n --type=rust 'use crate::parser::ast::|libmagic_rs::parser::ast::RegexFlags' src/parser/codegen.rs

echo
echo "3) Find where generated builtin rules are included/consumed:"
rg -n --type=rust 'generate_builtin_rules|BUILTIN_RULES|include!\(' src build.rs

Repository: EvilBit-Labs/libmagic-rs

Length of output: 2533


Fix generated code path for RegexFlags to use crate:: prefix.

The codegen emits libmagic_rs::parser::ast::RegexFlags on line 246, but generated code imports from crate::parser::ast::... (line 32) with no crate self-alias. This causes a compilation error when Regex rules are serialized. Change to crate::parser::ast::RegexFlags to match the generated import style and ensure in-crate resolution.

Suggested patch
-            format!(
-                "TypeKind::Regex {{ flags: libmagic_rs::parser::ast::RegexFlags {{ case_insensitive: {}, start_offset: {}, line_based: {} }}, count: {count_expr} }}",
-                flags.case_insensitive, flags.start_offset, flags.line_based
-            )
+            format!(
+                "TypeKind::Regex {{ flags: crate::parser::ast::RegexFlags {{ case_insensitive: {}, start_offset: {}, line_based: {} }}, count: {count_expr} }}",
+                flags.case_insensitive, flags.start_offset, flags.line_based
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
TypeKind::Regex { flags, count } => {
let count_lit = match count {
Some(n) => format!("::std::num::NonZeroU32::new({}).unwrap()", n.get()),
None => String::new(),
};
let count_expr = if count.is_some() {
format!("Some({count_lit})")
} else {
"None".to_string()
};
format!(
"TypeKind::Regex {{ flags: libmagic_rs::parser::ast::RegexFlags {{ case_insensitive: {}, start_offset: {}, line_based: {} }}, count: {count_expr} }}",
flags.case_insensitive, flags.start_offset, flags.line_based
)
}
TypeKind::Search { range } => format!(
"TypeKind::Search {{ range: ::std::num::NonZeroUsize::new({}).unwrap() }}",
range.get()
),
TypeKind::Regex { flags, count } => {
let count_lit = match count {
Some(n) => format!("::std::num::NonZeroU32::new({}).unwrap()", n.get()),
None => String::new(),
};
let count_expr = if count.is_some() {
format!("Some({count_lit})")
} else {
"None".to_string()
};
format!(
"TypeKind::Regex {{ flags: crate::parser::ast::RegexFlags {{ case_insensitive: {}, start_offset: {}, line_based: {} }}, count: {count_expr} }}",
flags.case_insensitive, flags.start_offset, flags.line_based
)
}
TypeKind::Search { range } => format!(
"TypeKind::Search {{ range: ::std::num::NonZeroUsize::new({}).unwrap() }}",
range.get()
),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/parser/codegen.rs` around lines 235 - 253, The generated code emits
libmagic_rs::parser::ast::RegexFlags which doesn't match in-crate imports; in
the TypeKind::Regex branch of src/parser/codegen.rs (the match arm constructing
the format! string for Regex), replace the path
libmagic_rs::parser::ast::RegexFlags with crate::parser::ast::RegexFlags so the
produced string uses crate::parser::ast::RegexFlags {{ case_insensitive: ...,
start_offset: ..., line_based: ... }} and keep the existing flags.field
interpolations and count handling unchanged.

Comment on lines +2279 to +2446
#[test]
fn test_parse_type_and_operator_regex_and_search_suffixes() {
use crate::parser::ast::{RegexFlags, TypeKind};
use std::num::{NonZeroU32, NonZeroUsize};

fn rx(case: bool, start: bool, line: bool, count: Option<u32>) -> TypeKind {
TypeKind::Regex {
flags: RegexFlags {
case_insensitive: case,
start_offset: start,
line_based: line,
},
count: count.and_then(NonZeroU32::new),
}
}
fn sr(n: usize) -> TypeKind {
TypeKind::Search {
range: NonZeroUsize::new(n).unwrap(),
}
}

let cases: &[(&str, TypeKind, &str)] = &[
("regex", rx(false, false, false, None), ""),
("regex/c", rx(true, false, false, None), ""),
("regex/l", rx(false, false, true, None), ""),
("regex/s", rx(false, true, false, None), ""),
("regex/cl", rx(true, false, true, None), ""),
("regex/lc", rx(true, false, true, None), ""),
("regex/cs", rx(true, true, false, None), ""),
("regex/csl", rx(true, true, true, None), ""),
("regex/1l", rx(false, false, true, Some(1)), ""),
("regex/l1", rx(false, false, true, Some(1)), ""),
("regex/1c", rx(true, false, false, Some(1)), ""),
("regex/256", rx(false, false, false, Some(256)), ""),
("regex/c =", rx(true, false, false, None), "="),
("search/256", sr(256), ""),
("search/1", sr(1), ""),
("search/256 =", sr(256), "="),
];
for &(input, ref expected_kind, expected_rest) in cases {
let (rest, (kind, op)) = parse_type_and_operator(input).expect(input);
assert_eq!(rest, expected_rest, "rest for input: {input}");
assert!(op.is_none(), "operator for input: {input}");
assert_eq!(&kind, expected_kind, "kind for input: {input}");
}
}

#[test]
fn test_parse_type_and_operator_search_requires_range() {
// Bare `search` (no /N suffix) is a hard parse error per GNU `file`.
assert!(parse_type_and_operator("search").is_err());
// `search/0` is also rejected -- `NonZeroUsize` makes a zero-width
// scan unrepresentable.
assert!(parse_type_and_operator("search/0").is_err());
}

#[test]
fn test_parse_type_and_operator_regex_invalid_suffix() {
// Bare slash with no flags or count
assert!(parse_type_and_operator("regex/").is_err());
// Unrecognized flag letter
assert!(parse_type_and_operator("regex/z").is_err());
// Non-operator trailing character is still rejected
assert!(parse_type_and_operator("regex/cz").is_err());
// regex/0 is rejected because a zero count has no valid semantics
// (our parser uses NonZeroU32 to express "user specified a count").
assert!(parse_type_and_operator("regex/0").is_err());
}

#[test]
fn test_parse_type_and_operator_regex_operator_adjacent() {
use crate::parser::ast::{Operator, RegexFlags, TypeKind};

// `regex/c=` should leave `=` for parse_operator, matching the `regex/c =`
// (space-separated) behavior and mirroring `search/256=`.
let (rest, (kind, op)) = parse_type_and_operator("regex/c=").expect("regex/c=");
assert_eq!(rest, "=");
assert!(op.is_none());
assert_eq!(
kind,
TypeKind::Regex {
flags: RegexFlags {
case_insensitive: true,
..RegexFlags::default()
},
count: None,
}
);

// `regex/l!=` should leave `!=` for parse_operator.
let (rest, (kind, op)) = parse_type_and_operator("regex/l!=").expect("regex/l!=");
assert_eq!(rest, "!=");
assert!(op.is_none());
assert_eq!(
kind,
TypeKind::Regex {
flags: RegexFlags {
line_based: true,
..RegexFlags::default()
},
count: None,
}
);

// Confirm the full pipeline parses the operator correctly through
// parse_type_and_operator + parse_operator chaining.
let (rest, (_, _)) = parse_type_and_operator("regex/c=foo").expect("regex/c=foo");
let (rest_after_op, op) = crate::parser::grammar::parse_operator(rest).expect("operator");
assert_eq!(op, Operator::Equal);
assert_eq!(rest_after_op, "foo");
}

#[test]
fn test_parse_magic_rule_regex_and_search() {
use crate::parser::ast::RegexFlags;
use std::num::{NonZeroU32, NonZeroUsize};

// regex/c: case-insensitive flag
let input = r#"0 regex/c "hello" case-insensitive match"#;
let (remaining, rule) = parse_magic_rule(input).unwrap();
assert_eq!(remaining, "");
assert_eq!(rule.offset, OffsetSpec::Absolute(0));
assert_eq!(
rule.typ,
TypeKind::Regex {
flags: RegexFlags {
case_insensitive: true,
..RegexFlags::default()
},
count: None,
}
);
assert_eq!(rule.op, Operator::Equal);
assert_eq!(rule.value, Value::String("hello".to_string()));
assert_eq!(rule.message, "case-insensitive match");

// search/256
let input = r#"0 search/256 "MZ" DOS executable"#;
let (remaining, rule) = parse_magic_rule(input).unwrap();
assert_eq!(remaining, "");
assert_eq!(
rule.typ,
TypeKind::Search {
range: NonZeroUsize::new(256).unwrap(),
}
);
assert_eq!(rule.op, Operator::Equal);
assert_eq!(rule.value, Value::String("MZ".to_string()));
assert_eq!(rule.message, "DOS executable");

// regex/1l: line-based with a count of 1 (mirrors regex-eol.magic
// syntax). The count is now preserved, not discarded.
let input = r#">1 regex/1l "[0-9]+" version line"#;
let (remaining, rule) = parse_magic_rule(input).unwrap();
assert_eq!(remaining, "");
assert_eq!(rule.level, 1);
assert_eq!(
rule.typ,
TypeKind::Regex {
flags: RegexFlags {
line_based: true,
..RegexFlags::default()
},
count: NonZeroU32::new(1),
}
);
assert_eq!(rule.message, "version line");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Split new regex/search tests into a dedicated test module.

Coverage is solid, but this file is already far beyond the repository size guideline. Moving this block to src/parser/grammar/tests/regex_search.rs will keep future maintenance manageable.

As per coding guidelines: src/**/*.rs: Keep source files under 500-600 lines; split larger files into focused modules.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/parser/grammar/tests/mod.rs` around lines 2279 - 2446, Extract the
regex/search tests (functions
test_parse_type_and_operator_regex_and_search_suffixes,
test_parse_type_and_operator_search_requires_range,
test_parse_type_and_operator_regex_invalid_suffix,
test_parse_type_and_operator_regex_operator_adjacent,
test_parse_magic_rule_regex_and_search) into a new test file named
src/parser/grammar/tests/regex_search.rs, preserving all use/imports (e.g.
crate::parser::ast::{RegexFlags, TypeKind}, OffsetSpec, Operator, Value) and
helper functions rx and sr; remove those tests from the large mod.rs and add mod
regex_search; to the tests module so the new file is compiled, ensuring any
relative paths remain crate::... so nothing else needs adjusting, then run cargo
test to verify.

Comment thread src/parser/types.rs
Comment on lines +319 to +321
"search" => TypeKind::Search {
range: ::std::num::NonZeroUsize::new(1).expect("1 is nonzero"),
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove expect() from library mapping path.

This is a constant path; avoid panic-based construction here.

Suggested patch
         "search" => TypeKind::Search {
-            range: ::std::num::NonZeroUsize::new(1).expect("1 is nonzero"),
+            range: ::std::num::NonZeroUsize::MIN,
         },

As per coding guidelines: src/**/*.rs: No unwrap() or expect() in library code.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"search" => TypeKind::Search {
range: ::std::num::NonZeroUsize::new(1).expect("1 is nonzero"),
},
"search" => TypeKind::Search {
range: ::std::num::NonZeroUsize::MIN,
},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/parser/types.rs` around lines 319 - 321, Replace the panic-based
construction in the TypeKind::Search mapping by removing the expect() call and
using a safe, explicit non-panicking construction; specifically, change
::std::num::NonZeroUsize::new(1).expect("1 is nonzero") to an explicit unsafe
construction like unsafe { ::std::num::NonZeroUsize::new_unchecked(1) } (with a
brief comment that 1 is statically non-zero) so the library code doesn't call
expect()/panic.

@coderabbitai coderabbitai Bot removed parser Magic file parsing components and grammar evaluator Rule evaluation engine and logic testing Test infrastructure and coverage labels Apr 11, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/src/ast-structures.md`:
- Around line 454-456: Update the documentation entry describing search anchor
behavior so it accurately states that the anchor advances to the end of the
matched pattern (match-end) rather than by the full search window; specifically
modify the bullet that currently reads "Anchor advances by the entire search
window" to "Anchor advances to the end of the matched pattern (match-end)".
Ensure the description still notes the return type `Value::String` for matched
bytes and that only `Equal` and `NotEqual` operators are supported so consumers
of the Search/anchor semantics have correct expectations.

In `@docs/src/parser.md`:
- Line 353: The docs sentence incorrectly lists "lookaheads" as an example of
zero-width matches; update the wording near the `Value::String("")` example so
it only mentions supported constructs (e.g., `^`, `$`, `a*`) and remove any
reference to look-around syntax, since `regex::bytes::Regex` does not support
lookaheads; ensure the sentence still explains that zero-width matches are
preserved as `Value::String("")` and distinguish them from genuine misses.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 947f1be6-454e-4b11-9a53-23e230972195

📥 Commits

Reviewing files that changed from the base of the PR and between 2756f8c and a13dd13.

⛔ Files ignored due to path filters (1)
  • docs/ARCHITECTURE.md is excluded by none and included by none
📒 Files selected for processing (3)
  • docs/src/ast-structures.md
  • docs/src/compatibility.md
  • docs/src/parser.md

Comment on lines +454 to +456
- Returns `Value::String` containing the matched bytes if found within range
- Anchor advances by the entire search window regardless of where the match was found
- Only supports `Equal` and `NotEqual` operators
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix Search anchor advancement semantics.

The docs currently say the anchor advances by the full search window. This should state it advances to the end of the matched pattern (match-end), which is the implemented behavior.

As per coding guidelines: "docs/**: Review documentation quality, completeness, and accuracy. Ensure comprehensive API documentation, usage examples, and migration guides. Focus on user experience and developer onboarding."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/src/ast-structures.md` around lines 454 - 456, Update the documentation
entry describing search anchor behavior so it accurately states that the anchor
advances to the end of the matched pattern (match-end) rather than by the full
search window; specifically modify the bullet that currently reads "Anchor
advances by the entire search window" to "Anchor advances to the end of the
matched pattern (match-end)". Ensure the description still notes the return type
`Value::String` for matched bytes and that only `Equal` and `NotEqual` operators
are supported so consumers of the Search/anchor semantics have correct
expectations.

Comment thread docs/src/parser.md

- Patterns are compiled with multi-line mode always enabled (matching libmagic's unconditional `REG_NEWLINE`), so `^` and `$` match at line boundaries and `.` does not match `\n`.
- The scan window is always capped at 8192 bytes regardless of the `count` value.
- Zero-width matches (`^`, `a*`, lookaheads) are preserved as `Value::String("")` and distinguished from genuine misses.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove unsupported “lookaheads” from regex semantics examples.

regex::bytes::Regex does not support look-around syntax, so documenting lookaheads here is inaccurate and can mislead rule authors. Keep zero-width examples to supported constructs (e.g., ^, $, a*).

As per coding guidelines: "docs/**: Review documentation quality, completeness, and accuracy. Ensure comprehensive API documentation, usage examples, and migration guides. Focus on user experience and developer onboarding."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/src/parser.md` at line 353, The docs sentence incorrectly lists
"lookaheads" as an example of zero-width matches; update the wording near the
`Value::String("")` example so it only mentions supported constructs (e.g., `^`,
`$`, `a*`) and remove any reference to look-around syntax, since
`regex::bytes::Regex` does not support lookaheads; ensure the sentence still
explains that zero-width matches are preserved as `Value::String("")` and
distinguish them from genuine misses.

@unclesp1d3r unclesp1d3r merged commit c05dfdc into main Apr 11, 2026
30 checks passed
@unclesp1d3r unclesp1d3r deleted the 39-parser-implement-regex-and-search-types branch April 11, 2026 06:30
@unclesp1d3r unclesp1d3r restored the 39-parser-implement-regex-and-search-types branch April 11, 2026 06:31
@unclesp1d3r unclesp1d3r deleted the 39-parser-implement-regex-and-search-types branch April 11, 2026 06:31
unclesp1d3r added a commit that referenced this pull request Apr 11, 2026
Post-review fixes from PR #214 internal review round 3.

Grammar:

* `src/parser/grammar/mod.rs`: reject a second numeric count in a regex
  suffix as a hard parse error (e.g., `regex/1l2l`, `regex/1c2l`,
  `regex/l1l2`). Libmagic accepts these with a "multiple ranges"
  stderr warning; we prefer a hard error so magic-file bugs surface at
  parse time rather than silently using the last-seen count.

Tests:

* `src/parser/grammar/tests/mod.rs`: add
  `test_parse_type_and_operator_regex_rejects_duplicate_count` covering
  the three duplicate-count forms above plus a `regex/l0` rejection.
* `src/evaluator/types/regex.rs`: add 8192-byte boundary regression
  tests:
  - `test_read_regex_pattern_ending_exactly_at_byte_8192_matches` --
    pattern at indices 8186..8192 must match (last byte is 8191,
    inside the cap).
  - `test_read_regex_pattern_starting_at_byte_8192_misses` -- pattern
    at index 8192 (one byte past cap) must miss.
  - `test_read_regex_pattern_straddling_8192_boundary_misses` --
    pattern at indices 8190..8196 (first 2 bytes inside cap, last 4
    past) must miss.
  - `test_read_regex_line_based_respects_8192_cap` -- line mode with
    count=None on a 9000-byte buffer must cap at 8192.
* `src/evaluator/types/regex.rs`: add classic-Mac CR terminator
  regression tests:
  - `test_read_regex_line_based_bare_cr_terminator` -- bare `\r` alone
    counts as a single line terminator (matches GNU `file`'s
    memchr2('\n', '\r') loop).
  - `test_read_regex_line_based_mixed_terminators` -- buffer with LF,
    CRLF, and CR each counting as one line; count=2 stops after the
    second terminator (a CRLF) so line 3 is not scanned.

These close the "8192 boundary untested" and "bare-CR untested"
coverage gaps flagged in round 3 review. The duplicate-count rejection
closes the "silent last-wins" divergence from libmagic that was also
flagged.

Refs #39

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
unclesp1d3r added a commit that referenced this pull request Apr 11, 2026
Post-merge cleanup of doc drift found in PR #214 review round 3.

src/parser/ast.rs:

* Remove the "effective byte cap of `min(n * 80, 8192)`" clause from
  the `TypeKind::Regex::count` rustdoc. The `n * 80` heuristic does
  not exist in `compute_window`; the code applies only the flat 8192
  cap and walks line terminators. Reword to describe the actual
  behavior (8192 cap always applies regardless of count).

src/evaluator/types/mod.rs:

* Rewrite the `bytes_consumed_with_pattern` docstring section on
  pattern-bearing types. The old doc claimed Search's consumed
  distance "is the entire search window regardless of where the match
  was found" -- this is the pre-fix window-end bug, not the current
  match-end behavior. Rewrite to describe GNU `file` semantics: regex
  advances by `m.end()` (or `m.start()` with `/s`), search advances
  by `match_idx + pattern.len()`. Both arms fire `debug_assert!` on
  engine-invariant violations.
* Rewrite the `TypeReadError::UnsupportedType` variant doc. The old
  text listed regex/date/timestamp as "reserved for future types not
  yet evaluatable" -- all three are now implemented, and the variant
  is actively used for regex pattern compile errors, missing pattern
  operands, and non-equality operators on pattern-bearing types.
  Rewrite to describe the current reporting contract.

No code changes; this commit only corrects documentation that
contradicts the shipped implementation.

Refs #39

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
unclesp1d3r added a commit that referenced this pull request Apr 11, 2026
Post-merge coverage fix from PR #214 review round 3. All 8 corpus
tests in tests/regex_search_corpus_tests.rs previously used
programmatic rule construction (`regex_rule`/`search_rule`), so the
grammar-to-evaluator pipeline for the new regex/search flag syntax
was never tested end-to-end via `parse_text_magic_file`.

New tests:

* `test_json1_corpus_parser_roundtrip` — parses a plain
  `0 regex "^\\s*[\\{\\[]" JSON text data` magic rule via
  `parse_text_magic_file` and evaluates it against `json1.testfile`.
  Exercises the default-flags, no-count regex path from text through
  to the evaluator.
* `test_regex_flag_parser_roundtrip_case_insensitive` — same pattern
  with `regex/c` instead of `regex`. Confirms the `/c` flag threads
  through `parse_type_and_operator` into `RegexFlags::case_insensitive`.
* `test_search_parser_roundtrip_with_range` — `search/32 "ABC"` parsed
  from text and evaluated against `searchbug.testfile`. Exercises
  the `NonZeroUsize` range path through the grammar layer.
* `test_search_parser_rejects_bare_search` — magic-file text
  `search "ABC"` (no /N) must be a parse error, confirming the
  `NonZeroUsize` constraint is enforced at parse time, not just at
  programmatic construction time.

These close the "parser round-trip untested for new flag syntax"
coverage gap flagged in review. Existing `regex_rule`/`search_rule`
programmatic tests are kept for the searchbug hierarchical scan which
still depends on features the parser doesn't yet support
(`use`/`name`/`offset` directives).

Refs #39

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
unclesp1d3r added a commit that referenced this pull request Apr 11, 2026
…sign

Post-merge doc cleanup from PR #214 review round 3.

docs/solutions/.../implementing-variable-width-typekind-variant.md:

* Rewrite the `## Solution` section so the code blocks describe the
  final (V5) design instead of the obsolete V0 signatures. The old
  snippets showed `read_regex` with `start_of_line: bool`, `build_regex`
  with `^(?:{pattern})` wrapping, and `read_search` with
  `Option<usize>` — all three were removed in the completion pass but
  the solution doc was never updated. The `# Prevention` section was
  already correct; only the code blocks in `## Solution` drifted.
* New code blocks describe `build_regex` (unconditional multi-line +
  no dot-newline), `compute_window` (8192 cap + line-terminator walk),
  `read_regex`/`read_search` with their current signatures
  (`RegexFlags`, `Option<NonZeroU32>`, `NonZeroUsize`), and the
  `/s`-flag anchor-advance in `regex_bytes_consumed`.
* Rewrite the `## Testing` section to list the actual test coverage:
  line-based window tests, 8192 boundary guards, zero-width match
  distinction, `/s` flag tests, non-equality operator rejection,
  parser last-wins rejection, corpus integration tests.

AGENTS.md:

* Development Phases v0.3 entry used to list `/s`, `/l` line-count,
  `regex/Nl`, search range enforcement, and 8192 default as planned
  work. All are now implemented. Replace with the actual v0.3 scope:
  Aho-Corasick multi-pattern optimization, compiled-regex caching,
  and `!:mime`/`!:ext` directive evaluation.

docs/src/evaluator.md:

* The Type Reading section described `/l` as "multi-line start-of-line
  anchoring" — this is the pre-fix semantic, not the current one.
  Rewrite to list all three regex flags (`/c`, `/s`, `/l`) with
  correct meanings, note that multi-line mode is unconditional, and
  describe the 8192-byte hard cap. Also update the Search bullet to
  mention the mandatory `NonZeroUsize` range.

.github/copilot-instructions.md:

* Remove "regex type is planned for future releases and is not yet
  implemented (#39)" note — regex and search are both implemented.
* Remove the `BinaryRegex` trait sketch (binary-safe matching is done
  via `regex::bytes::Regex` directly, not through a wrapper trait).
* Update the Currently Implemented list to mention Regex and Search
  with their flag sets.
* Update the v1.0+ Planned Features list to drop regex/search items
  and add Aho-Corasick + named tests.

No code changes; this commit only brings documentation in line with
the shipped implementation.

Refs #39

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
unclesp1d3r added a commit that referenced this pull request Apr 11, 2026
Closes deferred review suggestion #22 from PR #214 round-3.

Before: `type_keyword_to_kind` returned a placeholder
`TypeKind::Search { range: NonZeroUsize::new(1).unwrap() }` for the
`search` keyword and a bare `TypeKind::Regex { .. }` for `regex`, both
of which were "safe" only because grammar/mod.rs always overwrites
them via direct construction before the value escapes. That's an
invariant-by-convention, which is exactly the anti-pattern this PR
has been trying to remove via `NonZeroUsize` and `NonZeroU32` in the
AST.

After: the function returns `Option<TypeKind>` and yields `None` for
both `regex` and `search`. Grammar's `_ => type_keyword_to_kind(type_name)`
branch now uses `.expect(...)` with an explanatory message that
documents the "regex/search handled directly in grammar arms above"
invariant — so a future caller who adds a new suffix-required type
without wiring the grammar arm first will fail loudly at the
`expect`, rather than silently producing a placeholder that gets
eaten by downstream code.

Tests updated:

- `test_type_keyword_to_kind_*` tests wrap each expected `TypeKind`
  in `Some(...)`.
- `test_type_keyword_to_kind_regex_and_search_return_none` is new —
  it pins both halves of the split invariant (parse_type_keyword
  recognizes the keyword, type_keyword_to_kind returns None).
- `test_pstring_keyword_defaults_to_one_byte_width` uses `.expect()`
  on the Option return.
- `test_roundtrip_all_keywords` is split into two loops: one for
  convertible keywords (must produce `Some`) and one for
  regex/search (must produce `None`), so both sides of the
  contract are asserted explicitly.

Refs #39

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
unclesp1d3r added a commit that referenced this pull request Apr 11, 2026
Closes deferred review suggestions #19 and #20 from PR #214 round-3.

## #19: RegexCount enum replaces line_based bool + count Option

The previous design encoded the three regex scan-window modes across
two separate fields:

  RegexFlags { line_based: bool, .. }
  TypeKind::Regex { count: Option<NonZeroU32>, .. }

This admitted four `(line_based, count)` combinations while libmagic
only has three semantic modes, leaving a degenerate
`line_based: true, count: None` state that in practice collapsed to
the same behavior as `line_based: false, count: None`. It also
forced readers to remember that `count` means bytes in one branch and
lines in another.

Replace both fields with a single `RegexCount` enum:

```rust
pub enum RegexCount {
    Default,                          // plain `regex` (8192-byte default)
    Bytes(NonZeroU32),                // `regex/N` (byte-bounded)
    Lines(Option<NonZeroU32>),        // `regex/l` or `regex/Nl`
}
```

- The three libmagic scan modes map 1:1 to the three variants.
- The degenerate `line_based: true, count: None` state is now
  unrepresentable: line-mode is exclusively the `Lines(_)` variant.
- The byte-vs-line distinction is encoded at the type level, so match
  arms at every call site (`compute_window`, strength calculation,
  codegen, property tests) read exhaustively rather than relying on
  a sibling `bool`.
- `Lines(None)` preserves the `regex/l` shorthand (no explicit line
  count), which walks terminators to the end of the 8192-byte capped
  window.

`RegexFlags` is now just `{ case_insensitive, start_offset }`. The
`/l` modifier moves from being a flag into being the `Lines` variant
of `RegexCount`.

Ripple:

- `compute_window` (`src/evaluator/types/regex.rs`): rewrite the
  branch dispatch to match on `RegexCount` instead of `flags.line_based`.
- `read_regex` and `regex_bytes_consumed`: take `RegexCount` instead of
  `(flags, Option<NonZeroU32>)`.
- `bytes_consumed_with_pattern`: pass `*count` through via `Copy`.
- `strength.rs`: match on `RegexCount` for the "explicit count = more
  specific = +5 bonus" logic.
- `parser/codegen.rs::serialize_type_kind`: emit the three variants
  explicitly with `NonZeroU32::new(n).expect("nonzero")`.
- `parser/grammar/mod.rs`: accumulate `line_based: bool` and
  `count_value: Option<NonZeroU32>` in the suffix loop, then collapse
  to the appropriate `RegexCount` variant at the end.
- `tests/property_tests.rs::arb_type_kind`: generator now produces all
  three variants with weighted odds.
- All test constructions updated across engine tests, types tests,
  grammar tests, corpus tests, and the evaluator_tests hierarchical
  rule fixture.

## #20: REGEX_MAX_BYTES relocated to evaluator::types::regex

The constant is runtime evaluation policy, not AST shape — it lives
where `compute_window` reads it. Moving it out of `parser::ast` also
removes a latent coupling to the build-script compilation unit
(`ast.rs` is shared with `build.rs` per GOTCHAS S1.1).

The constant is now `pub(crate)` in `src/evaluator/types/regex.rs`
with an in-comment explanation of the rationale. The rustdoc on
`TypeKind::Regex::count` used to link to `[REGEX_MAX_BYTES]` via
intra-doc link; rewritten to cite the "8192-byte cap" inline so the
AST module has no dependency on the evaluator constant.

Refs #39

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
unclesp1d3r added a commit that referenced this pull request Apr 11, 2026
Closes deferred review suggestion #23 from PR #214 round-3.

Before: grammar/mod.rs was 951 lines, past the 800-line hard cap
documented in AGENTS.md coding-style.md. The bulk of the growth was
the `parse_type_and_operator` function (over 250 lines) which handled
pstring, regex, search, and the attached bitwise-AND operator inline.

After: grammar/mod.rs is 793 lines. The extracted suffix parsers
live in a new `grammar/type_suffix.rs` submodule (232 lines) with
four functions:

- `parse_pstring_suffix(input)` — the `/[BHhLl][J]?` or bare `/J`
  pstring length-width parser. Previously inline as a standalone fn;
  now lives next to its peers.
- `parse_regex_suffix(input, suffix_rest)` — the "any order, last
  count wins but duplicates rejected" regex modifier parser. Returns
  `(RegexFlags, RegexCount)` with the tri-state collapse inlined.
- `parse_search_suffix(input, suffix_rest)` — the mandatory
  `NonZeroUsize` range parser for `search/N`.
- `parse_attached_operator(input)` — the post-type `&<mask>` /
  `&` / `&&` handler. This is the new extraction; `parse_type_and_operator`
  previously had ~20 lines of inline match-on-`strip_prefix('&')` logic.

`parse_type_and_operator` in grammar/mod.rs is now a thin
orchestrator: call `parse_type_keyword`, dispatch to the appropriate
suffix parser based on `type_name`, then `parse_attached_operator`,
then build the final `TypeKind` in a short match.

No behavior change. The old `parse_type_and_operator` had
`#[allow(clippy::too_many_lines)]`; that attribute is now removed
since the function fits under the default threshold.

Refs #39

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
unclesp1d3r added a commit that referenced this pull request Apr 11, 2026
## Summary

Follow-up cleanup for [PR
#214](#214) (which
closed #39). The review-round-3 findings on #214 landed after it merged,
so this PR carries the fixes as a separate unit of work.

All fixes are internal review findings — no user-visible behavior change
beyond the grammar now hard-rejecting duplicate regex counts.

## Changes

### Parser: reject duplicate regex counts (`5ca9a6f`)
- `src/parser/grammar/mod.rs`: `regex/1l2l`, `regex/1c2l`, `regex/l1l2`
are now hard parse errors. Libmagic accepts them with a `"multiple
ranges"` stderr warning; we prefer a hard error so magic-file bugs
surface at parse time rather than silently using the last-seen count.
- Test: `test_parse_type_and_operator_regex_rejects_duplicate_count`.

### Tests: 8192-byte boundary + bare-CR line terminator (`5ca9a6f`)
- Exact-boundary regression tests for the `FILE_REGEX_MAX` cap: pattern
ending at byte 8191 (must match), starting at 8192 (must miss),
straddling the boundary (must miss), line-based scan respecting the cap.
- Bare-CR (classic Mac) line terminator tests: `\r` alone counts as a
single terminator, mixed LF/CRLF/CR buffer terminator walking.

### Rustdoc: fix `RegexFlags::count` and `bytes_consumed_with_pattern`
drift (`5b9df07`)
- `src/parser/ast.rs`: remove the "effective byte cap of `min(n * 80,
8192)`" clause from `TypeKind::Regex::count` rustdoc. No such heuristic
exists in `compute_window`.
- `src/evaluator/types/mod.rs`: rewrite the
`bytes_consumed_with_pattern` docstring section on pattern-bearing
types. The old wording claimed Search advances by the window size — that
was the pre-fix bug, not current behavior.
- `src/evaluator/types/mod.rs`: rewrite the
`TypeReadError::UnsupportedType` variant doc. Old text listed
regex/date/timestamp as "reserved for future types not yet evaluatable";
all three are now implemented.

### Tests: parser round-trip corpus tests (`648f1a8`)
- `tests/regex_search_corpus_tests.rs`: 4 new tests that exercise the
grammar → evaluator pipeline for regex/search flag syntax via
`parse_text_magic_file`. Previously all corpus tests used programmatic
rule construction, so the parser-side of the new flag syntax was never
tested end-to-end.
- Covers: plain `regex` rule, `regex/c` case-insensitive flag,
`search/N` range, and bare `search` parse rejection.

### Docs: rewrite solution doc and clean up stale references (`aeb61d6`)
-
`docs/solutions/integration-issues/implementing-variable-width-typekind-variant.md`:
rewrite the `## Solution` section so the code blocks describe the final
(V5) design. Old snippets showed `read_regex` with `start_of_line:
bool`, `build_regex` with `^(?:{pattern})` wrapping, and `read_search`
with `Option<usize>` — all three were removed in the completion pass but
the doc was never updated.
- `AGENTS.md`: Development Phases v0.3 entry listed items that are
already implemented. Replaced with actual v0.3 scope (Aho-Corasick,
regex caching, `!:mime`/`!:ext` directives).
- `docs/src/evaluator.md`: Type Reading section described `/l` as
"multi-line start-of-line anchoring". Rewritten to list all three regex
flags correctly and note the 8192-byte cap.
- `.github/copilot-instructions.md`: remove "regex type is planned for
future releases" note; remove `BinaryRegex` trait sketch; update
Currently Implemented and Planned Features lists.

## Test Plan

- [x] `mise exec -- cargo nextest run` — 1218/1218 pass (+11 new
regression tests)
- [x] `mise exec -- cargo clippy --all-targets -- -D warnings` — clean
- [x] `mise exec -- cargo test --doc` — 167/167 doctests pass
- [x] `mise exec -- cargo fmt --check` — clean
- [x] `mise exec -- just ci-check` — full CI suite green locally
- [ ] CodeRabbit review clean
- [ ] Manual smoke: verify `regex/1l2l` returns a parse error (hit by
`test_parse_type_and_operator_regex_rejects_duplicate_count`)
- [ ] Manual smoke: verify a regex pattern past byte 8192 is NOT matched
by an un-counted `regex` rule (hit by the three 8192 boundary tests)

## Review Findings Addressed

From the round-3 review on PR #214 (after merge):

| # | Finding | Status |
|---|---|---|
| C1 | bare-CR line terminator doc + test | ✅ docstring updated, tests
added (`5ca9a6f`) |
| C2 | `n * 80` claim in `TypeKind::Regex::count` rustdoc | ✅ removed
(`5b9df07`) |
| C3 | `bytes_consumed_with_pattern` docstring drift | ✅ rewritten
(`5b9df07`) |
| C4 | Solution doc `## Solution` section drift | ✅ rewritten
(`aeb61d6`) |
| C5 | Silent "last count wins" on `regex/1l2l` | ✅ hard parse error
(`5ca9a6f`) |
| I7 | Missing 8192-byte boundary tests | ✅ 4 new tests (`5ca9a6f`) |
| I12 | Parser→evaluator round-trip untested for new flag syntax | ✅ 4
round-trip tests (`648f1a8`) |
| I13 | `/l` = "multi-line start-of-line anchoring" in
docs/src/evaluator.md | ✅ fixed (`aeb61d6`) |
| I14 | AGENTS.md v0.3 roadmap stale | ✅ fixed (`aeb61d6`) |
| I15 | `TypeReadError::UnsupportedType` variant doc | ✅ rewritten
(`5b9df07`) |
| I16 | `.github/copilot-instructions.md` stale regex reference | ✅
fixed (`aeb61d6`) |

Deferred (design-level, warrant their own discussion):
- Suggestion #19: `RegexCount { Default, Bytes, Lines }` enum refactor
- Suggestion #20: move `REGEX_MAX_BYTES` from `ast.rs` to
`evaluator::types::regex`
- Suggestion #21: rename `line_based` → `line_counted`
- Suggestion #22: `type_keyword_to_kind` signature change
- Suggestion #23: split `src/parser/grammar/mod.rs` (now 914 lines)

Refs #39

---------

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Co-authored-by: dosubot[bot] <131922026+dosubot[bot]@users.noreply.github.com>
@github-actions github-actions Bot mentioned this pull request Apr 25, 2026
unclesp1d3r pushed a commit that referenced this pull request Apr 25, 2026
## 🤖 New release

* `libmagic-rs`: 0.5.0 -> 0.6.0 (⚠ API breaking changes)

### ⚠ `libmagic-rs` breaking changes

```text
--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field MagicRule.value_transform in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:1189
  field MagicRule.value_transform in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:1189
  field MagicRule.value_transform in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:1189

--- failure copy_impl_added: type now implements Copy ---

Description:
A public type now implements Copy, causing non-move closures to capture it by reference instead of moving it.
        ref: rust-lang/rust#100905
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/copy_impl_added.ron

Failed in:
  libmagic_rs::mime::MimeMapper in /tmp/.tmpwFvgw1/libmagic-rs/src/mime.rs:98

--- failure enum_marked_non_exhaustive: enum marked #[non_exhaustive] ---

Description:
A public enum has been marked #[non_exhaustive]. Pattern-matching on it outside of its crate must now include a wildcard pattern like `_`, or it will fail to compile.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#attr-adding-non-exhaustive
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/enum_marked_non_exhaustive.ron

Failed in:
  enum OffsetSpec in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:198
  enum OffsetSpec in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:198
  enum OffsetSpec in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:198
  enum LibmagicError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:15
  enum LibmagicError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:15
  enum IoError in /tmp/.tmpwFvgw1/libmagic-rs/src/io/mod.rs:26
  enum Operator in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:838
  enum Operator in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:838
  enum Operator in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:838
  enum TypeReadError in /tmp/.tmpwFvgw1/libmagic-rs/src/evaluator/types/mod.rs:56
  enum ParseError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:74
  enum ParseError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:74
  enum Value in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:965
  enum Value in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:965
  enum Value in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:965
  enum TypeKind in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:398
  enum TypeKind in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:398
  enum TypeKind in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:398
  enum EvaluationError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:148
  enum EvaluationError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:148

--- failure enum_struct_variant_field_added: pub enum struct variant field added ---

Description:
An enum's exhaustive struct variant has a new field, which has to be included when constructing or matching on this variant.
        ref: https://doc.rust-lang.org/reference/attributes/type_system.html#the-non_exhaustive-attribute
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/enum_struct_variant_field_added.ron

Failed in:
  field base_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:251
  field adjustment_op of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:266
  field result_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:272
  field base_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:251
  field adjustment_op of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:266
  field result_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:272
  field base_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:251
  field adjustment_op of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:266
  field result_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:272

--- failure function_missing: pub fn removed or renamed ---

Description:
A publicly-visible function cannot be imported by its prior path. A `pub use` may have been removed, or the function itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/function_missing.ron

Failed in:
  function libmagic_rs::parser::grammar::is_empty_line, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:1025
  function libmagic_rs::parser::grammar::parse_strength_directive, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:846
  function libmagic_rs::parser::grammar::parse_type_and_operator, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:683
  function libmagic_rs::parser::grammar::parse_offset, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:179
  function libmagic_rs::parser::parse_offset, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:179
  function libmagic_rs::parser::grammar::parse_comment, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:1004
  function libmagic_rs::parser::grammar::parse_message, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:810
  function libmagic_rs::parser::grammar::parse_value, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:633
  function libmagic_rs::parser::grammar::parse_number, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:133
  function libmagic_rs::parser::parse_number, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:133
  function libmagic_rs::parser::grammar::has_continuation, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:1060
  function libmagic_rs::parser::grammar::parse_magic_rule, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:946
  function libmagic_rs::parser::grammar::parse_rule_offset, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:779
  function libmagic_rs::parser::grammar::is_comment_line, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:1042
  function libmagic_rs::parser::grammar::is_strength_directive, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:902
  function libmagic_rs::parser::grammar::parse_type, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:749
  function libmagic_rs::parser::grammar::parse_operator, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:227

--- failure function_parameter_count_changed: pub fn parameter count changed ---

Description:
A publicly-visible function now takes a different number of parameters.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#fn-change-arity
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/function_parameter_count_changed.ron

Failed in:
  libmagic_rs::evaluator::evaluate_single_rule now takes 3 parameters instead of 2, in /tmp/.tmpwFvgw1/libmagic-rs/src/evaluator/engine/mod.rs:196

--- failure inherent_method_missing: pub method removed or renamed ---

Description:
A publicly-visible method or associated fn is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/inherent_method_missing.ron

Failed in:
  FileBuffer::create_symlink, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/io/mod.rs:326
  EvaluationContext::increment_recursion_depth, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/evaluator/mod.rs:114
  EvaluationContext::decrement_recursion_depth, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/evaluator/mod.rs:130
  EvaluationContext::increment_recursion_depth, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/evaluator/mod.rs:114
  EvaluationContext::decrement_recursion_depth, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/evaluator/mod.rs:130

--- failure module_missing: pub module removed or renamed ---

Description:
A publicly-visible module cannot be imported by its prior path. A `pub use` may have been removed, or the module may have been renamed, removed, or made non-public.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/module_missing.ron

Failed in:
  mod libmagic_rs::parser::grammar, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:4

--- failure struct_marked_non_exhaustive: struct marked #[non_exhaustive] ---

Description:
A public struct has been marked #[non_exhaustive], which will prevent it from being constructed using a struct literal outside of its crate. It previously had no private fields, so a struct literal could be used to construct it outside its crate.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#attr-adding-non-exhaustive
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/struct_marked_non_exhaustive.ron

Failed in:
  struct EvaluationConfig in /tmp/.tmpwFvgw1/libmagic-rs/src/config.rs:42
```

<details><summary><i><b>Changelog</b></i></summary><p>

<blockquote>

## [0.6.0] - 2026-04-25

### Features

- **parser**: Add Date and QDate types with serialization support
([#165](#165))
- **parser**: Implement pstring (Pascal string) type
([#170](#170))
- **parser**: Implement pstring multi-byte length prefix variants (/B,
/H, /h, /L, /l, /J)
([#183](#183))
- **evaluator**: Add debug-level tracing for skipped rules
([#184](#184))
- **evaluator**: Implement indirect offset resolution
([#37](#37))
([#199](#199))
- **evaluator**: Implement relative offset resolution
([#38](#38))
([#211](#211))
- **deps**: Add new skills to actionbook/rust-skills and
trailofbits/skills
- **evaluator**: Regex and search types (closes #39)
([#214](#214))
- Implement libmagic meta-type directives and format substitution
([#42](#42))
([#230](#230))

### Bug Fixes

- **regex**: PR #214 follow-up review findings
([#215](#215))
- Load and correctly evaluate /usr/share/file/magic/filesystems and
adjacent magic files
([#233](#233))

### Documentation

- **gotchas**: Clarify requirements for adding TypeKind variants

### Miscellaneous Tasks

- Rename .coderabbitai.yaml to .coderabbit.yaml
- **Mergify**: Configuration update
([#173](#173))
- Update .gitignore to exclude local AI assistant files
- **mergify**: Upgrade configuration to current format
([#205](#205))
- Resolve all pending TODO items
([#212](#212))
- **mergify**: Upgrade configuration to current format
([#231](#231))
<!-- generated by git-cliff -->

### Security

- **io**: Close TOCTOU race in `FileBuffer::new` metadata validation
(CWE-367). `validate_file_metadata` now uses `File::metadata()` on the
open descriptor instead of re-canonicalizing the path, so an attacker
cannot swap the path between `open_file` and validation. Error paths now
report the caller-supplied path rather than the canonicalized variant.
- **cli**: Remove relative-path fallbacks from `default_magic_file_path`
(CWE-426). `./missing.magic`, `./third_party/magic.mgc`, and the
`CI`/`GITHUB_ACTIONS` env-var branch no longer resolve against the
process cwd. CI pipelines must pass `--magic <path>` explicitly.
- **evaluator**: `build_regex` now bounds `size_limit` and
`dfa_size_limit` to 1 MiB (`REGEX_COMPILE_SIZE_LIMIT`) to reject
compile-time DoS patterns (CWE-1333) from adversarial magic files.

### Features

- **parser**: Implement meta-type directives: `name`/`use` subroutines,
`default`/`clear` per-level fallback, and `indirect` re-evaluation.
`parse_text_magic_file` now returns `ParsedMagic { rules, name_table }`
(breaking change from `Vec<MagicRule>`). Named subroutines are hoisted
into `NameTable` at load time and dispatched via `RuleEnvironment` in
the evaluator. Recursion is bounded by
`EvaluationConfig::max_recursion_depth`. Resolves
[#42](#42).
- **evaluator**: Thread-local regex compile cache eliminates the
double-compile paid by every successful regex match.
`regex_bytes_consumed` now reuses the compiled `Regex` from `read_regex`
instead of recompiling the pattern to derive the anchor advance. The
cache is reset at the start of every `evaluate_rules_with_config` call,
bounding memory to one evaluation.
- **config**: `EvaluationConfig` is now `#[non_exhaustive]`; new
builder-style setters (`with_max_recursion_depth`,
`with_max_string_length`, `with_stop_at_first_match`, `with_mime_types`,
`with_timeout_ms`) let external crates construct configurations without
struct literals.
- **parser**: `MagicRule::new()` smart constructor with
`::with_children()`, `::with_strength_modifier()`, `::with_level()`
builder methods and a `::validate()` method enforcing structural
invariants (non-empty message, `level <= MAX_LEVEL`, children nested
strictly deeper than parent). New `MagicRuleValidationError` error type.
- **parser**: `RegexFlags::with_case_insensitive()` and
`::with_start_offset()` builder methods.

### Refactor

- **engine**: Extract `evaluate_pattern_rule()` and
`evaluate_value_rule()` helpers from
`evaluate_single_rule_with_anchor`'s 90-line body. Dispatch is now a
two-arm type-category split; each helper has focused rustdoc on
semantics and invariants.
- **types**: Replace the `_ =>` catch-all in
`bytes_consumed_with_pattern` with an explicit listing of the
fixed-width `TypeKind` variants. Adding a new variable-width variant
without updating this match is now a compile error instead of a silent
relative-offset anchor corruption in release builds.
- **parser**: Split the 185-line `type_keyword_to_kind` match into
per-family helpers (`byte_family`, `short_family`, `long_family`,
`quad_family`, `float_family`, `double_family`, `date_family`,
`qdate_family`, `string_family`). Drops the
`#[allow(clippy::too_many_lines)]` attribute.
- **main**: `main()` returns `std::process::ExitCode` instead of calling
`process::exit`, so destructors run on the happy path. Ctrl-C
`AtomicBool` flag uses `Ordering::Relaxed` instead of `SeqCst`.
- **grammar**: `parse_strength_directive` uses nom 8's `preceded` +
`Parser::map` instead of the legacy `map(pair(char(...), parse_number),
|(_, n)| ...)` pattern.
- **output**: Add `#[serde(skip_serializing_if = "Option::is_none",
default)]` to public `Option<T>` fields so JSON output no longer emits
`"field": null` for unset optional values.

### Documentation

- **lib**: Add `# Security` sections to
`MagicDatabase::with_builtin_rules`, `::with_builtin_rules_and_config`,
`::load_from_file`, and `::load_from_file_with_config` warning about the
unbounded default timeout and recommending
`EvaluationConfig::performance()` for untrusted input.
- **lib**: Document `MagicDatabase: Send + Sync` for parallel scanning.
- **README**: Update `TypeKind` enum example to match the current AST,
add `regex` and `search/N` to the supported types table, add pre-1.0 API
stability warning, correct the roadmap to mark v0.2-v0.4 as shipped.
- **AGENTS.md**: Relabel "Currently Implemented (v0.1.0)" and "Current
Limitations (v0.1.0)" to v0.5.0 and rewrite the Development Phases
section to reflect actual shipped scope.

### Testing

- Security regression tests for S-H1 (planted-magic-file in cwd), S-H2
(TOCTOU path-swap contract), S-M2 (pathological regex bounded runtime),
S-L2 (codegen message escape round-trip), and GOTCHAS S13.1
(`EvaluationConfig::default()` unbounded timeout invariant).
- Backspace message concatenation regression tests for first-match,
consecutive, and empty-rest edge cases.
- `MagicRule::validate()` tests covering empty message, child level
invariant, and max-depth rejection.
- `RegexCache` population/clear/reuse tests.

### Breaking Changes

- **parser**: `parse_text_magic_file` return type changed from
`Result<Vec<MagicRule>, ParseError>` to `Result<ParsedMagic,
ParseError>`. Callers must destructure `ParsedMagic { rules, name_table
}`. Low-level callers that only need the rule list can use
`parsed.rules`. `load_magic_file` and `load_magic_directory` return the
same new type.
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/release-plz/release-plz/).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parser: implement regex and search types

2 participants