feat(evaluator): regex and search types (closes #39) by unclesp1d3r · Pull Request #214 · EvilBit-Labs/libmagic-rs

unclesp1d3r · 2026-04-11T05:54:55Z

Summary

Implements TypeKind::Regex and TypeKind::Search for libmagic rule evaluation, closing the full acceptance criteria for #39. This is the complete implementation including the spec-compliance follow-up work that was flagged in pre-PR review.

Highlights:

Regex type: binary-safe matching via regex::bytes::Regex with the full /c, /s, /l flag set plus optional numeric count. Scan window capped at 8192 bytes (FILE_REGEX_MAX). Multi-line mode is unconditional (matching libmagic's REG_NEWLINE).
Search type: bounded literal scan via memchr::memmem::find with mandatory NonZeroUsize range (per GNU file magic(5)). Anchor advances to match-end, matching softmagic.c::FILE_SEARCH.
RegexFlags struct: case_insensitive, start_offset, line_based — designed as a struct so future flag additions don't cascade through 10+ exhaustive-match sites.
Engine dispatch: pattern-bearing types bypass apply_operator and drive Equal/NotEqual from a structured Option<Value> returned by read_pattern_match. Non-equality operators on pattern types are rejected as TypeReadError::UnsupportedType rather than silently producing lexicographic comparisons against the pattern literal.
Zero-width match correctness: regex matches of zero bytes (^, a*, lookaheads) are preserved as Some(Value::String("")) and distinguished from genuine misses (None).

Ground truth for every non-obvious semantic was verified against GNU file's src/softmagic.c before implementing (FILE_REGEX, FILE_SEARCH, moffset(), alloc_regex, parse_string_modifier, FILE_REGEX_MAX).

Issue #39 Acceptance Criteria

#	Criterion	Status
1	`TypeKind` extended with `Regex`/`Search` variants	✅
2	Parser recognizes regex/search + flags (`/c`, `/s`, `/l`, count)	✅
3	Evaluator uses `regex::bytes::Regex` for binary-safe matching	✅
4	Search supports range-limited byte sequence searching	✅ mandatory `NonZeroUsize`
5	Common regex flags: case-insensitive + line-start/end	✅ `/c` + `/l` + `/s`
6	Unit tests for regex/search matching	✅ ~50 new tests
7	Integration tests with text file detection rules	✅ searchbug, json1, jsonlines1, cmd1, gedcom, regex-eol

Test Plan

mise exec -- just ci-check green (run locally: 1207/1207 tests pass, clippy clean, fmt clean, doctests pass)
CodeRabbit review clean
Manual smoke: echo '{"ok": true}' | cargo run -- --use-builtin - still detects JSON via built-ins (regex rules not added to built-in set, so this should still fall back to string match behavior)
Verify corpus tests run against third_party/tests fixtures on a fresh checkout
Confirm regex-eol.magic scenario passes end-to-end (exercised by tests/evaluator_tests.rs::test_regex_eol_corpus which was rewritten for the new flag semantics)
Spot-check that regex/1l and regex/l1 both parse identically (per GNU file "interleaved order, last range wins")
Confirm search/0 and bare search are parse errors (not silent acceptance)
Confirm regex/Ns-style non-equality operators are rejected (regex < "[0-9]+" should error, not match via lexicographic comparison)

Notes for Reviewers

The branch history includes an earlier "partial" foundation commit (cb02472) followed by the completion commit (7294fd6). The partial commit shipped a working shell with known spec gaps; the completion commit closes those gaps (the V1–V5 items flagged in internal review).
GOTCHAS.md sections 2.4–2.9 document every non-obvious invariant added by this PR, with citations to the corresponding GNU file source locations. Start there when reviewing semantics questions.
The new compute_window helper in src/evaluator/types/regex.rs is the load-bearing piece of V1 — it's where the 8192-byte cap, line-based counting, and CRLF handling all converge. The unit tests at the bottom of that file cover each code path explicitly.
read_pattern_match is the engine-facing entry point for pattern types; read_typed_value_with_pattern is kept as a back-compat wrapper that collapses None to Value::String(""). Future pattern-type work should go through read_pattern_match.

Closes #39

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Implements pattern-based file matching via `TypeKind::Regex` and `TypeKind::Search`, a foundation for text file detection (JSON, XML, shell scripts, etc.). Parser: * Add `Regex { case_insensitive, start_of_line }` and `Search { range }` variants to `TypeKind` in `parser/ast.rs`. Extend `parse_type_keyword`, `type_keyword_to_kind`, grammar suffix parsing (`regex/c`, `regex/l`, `regex/cl`, `search/N`), `serialize_type_kind`, and strength modifiers. * Update `arb_type_kind` in `tests/property_tests.rs` for the fuzz round-trip of the new variants. * Grammar flag scanner uses `strip_prefix` instead of direct `&rest[1..]` slicing per AGENTS.md memory-safety rule. Evaluator: * `src/evaluator/types/regex.rs` wraps `regex::bytes::RegexBuilder` for binary-safe matching. `/c` enables case-insensitive, `/l` wraps the pattern in `^(?:...)` with multi-line mode (placeholder for the libmagic `/l` line-count semantics tracked for v0.3). * `src/evaluator/types/search.rs` uses `memchr::memmem::find` for bounded literal scanning. * Both readers return `Result<Option<Value>, TypeReadError>`: `None` is a genuine miss, `Some(v)` is a match (including legitimate zero-width regex matches like `^`, `a*`, lookaheads). * Add `read_typed_value_with_pattern` and `read_pattern_match` dispatch functions in `types/mod.rs`. The engine calls `read_pattern_match` directly to preserve the `Option` discriminant; `read_typed_value` remains a 3-arg convenience wrapper forwarding `pattern: None`. * `bytes_consumed_with_pattern` threads the pattern through to `regex_bytes_consumed` and `search_bytes_consumed`. The search arm re-runs `memchr::memmem::find` and returns `match_idx + pattern.len()` (match-end), matching GNU `file` softmagic.c FILE_SEARCH semantics. An earlier revision returned the full window size, which silently corrupted relative-offset children of every successful search match. * Add `debug_assert!` guards on the silent-0 fallback paths in `regex_bytes_consumed`, `search_bytes_consumed`, and the Regex/Search arms of `bytes_consumed_with_pattern` so engine-invariant violations are caught loudly in dev/test builds. Engine: * `evaluate_single_rule_with_anchor` splits into two arms. Pattern- bearing types (`Regex`, `Search`) go through `read_pattern_match` and drive `Equal`/`NotEqual` from the `Option` discriminant directly. Non-equality operators on pattern types return `TypeReadError::UnsupportedType` instead of silently falling through to `apply_operator`, which would have produced nonsense lexicographic comparisons against the pattern source text. * Non-pattern types continue through the existing `read_typed_value_with_pattern` + `coerce_value_to_type` + `apply_operator` path. Dependencies: * Move `regex = "1.12.3"` from dev-dependencies to dependencies. The `bytes` feature was dropped in v1.12; `regex::bytes` is unconditional so no feature flag is needed. Note: This commit covers the core feature. It does NOT yet implement the `/s` (start-offset) flag, the libmagic `/l` line-count semantics, the `regex/Nl` numeric prefix, `NonZeroUsize` search ranges, or the 8192-byte default regex scan window. Those are tracked for a follow-up against issue #39 before closing it. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Regression guards for the three silent-failure classes flagged in code review: bytes_consumed dispatch (GOTCHAS 2.1): * `test_bytes_consumed_regex_with_string_pattern` and `test_bytes_consumed_regex_no_match_returns_zero` cover the Regex arm of `bytes_consumed_with_pattern`. Before this test the only coverage of the Regex arm was transitive through the corpus test and did not assert numeric anchor positions. * `test_bytes_consumed_regex_zero_width_match_returns_zero` guards the interaction between the C3 fix (zero-width matches distinguished from misses) and the anchor advance: a zero-width match returns `m.end() == 0` so the anchor does not move. * `test_bytes_consumed_search_with_pattern_is_match_end` is the regression guard for the pre-fix behavior that returned the full search window size instead of `match_idx + pattern.len()`. Buffer `"abcWorld_xyz"` with `search/10 "World"` must advance the anchor by 8, not by 10. * `test_bytes_consumed_search_bytes_pattern_works` exercises the `Value::Bytes` pattern dispatch path that was previously accepted by the reader but untested. Engine-level Relative(N) after pattern types: * `test_regex_parent_advances_anchor_for_relative_child` and `test_search_parent_advances_anchor_to_match_end_not_window_end` construct parent/child rule trees and assert that a `Relative(0)` child reads the byte immediately after the parent's match. This is the end-to-end test that would have caught the search window-vs- match-end bug -- the existing `test_regex_eol_corpus` passes on substring assertions and would not have detected a wrong anchor position. * `test_search_parent_relative_child_at_positive_offset` exercises `Relative(+1)` to confirm the match-end base is correct (not match-start, not window-end). Non-equality operator rejection: * `test_regex_rule_with_ordering_operator_is_rejected` and `test_search_rule_with_bitwise_operator_is_rejected` verify the engine returns `EvaluationError::UnsupportedType` when a pattern-bearing rule uses `<`, `>`, `&`, `^`, `~`, etc. Previously those fell through to `apply_operator` and silently produced lexicographic comparisons against the pattern literal -- a false-positive factory for any regex with metacharacters. Zero-width regex match coverage (in `src/evaluator/types/regex.rs`): * `test_read_regex_zero_width_start_anchor_matches` asserts `^` returns `Some(Value::String(""))`, not `None`. The pre-fix engine flattened both cases into a miss via `is_empty()`. * `test_read_regex_zero_width_star_matches_empty` does the same for the `a*` pattern against a buffer with no `a` bytes. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

AGENTS.md: * Move regex and search from "Planned Features" / "Current Limitations: No regex/search" to "Currently Implemented". The prior state had six separate stale references that directly contradicted the implementation. * Add an explicit "Current Limitations" entry for the `/l` flag (implemented as a multi-line `^` anchor, not the libmagic line-count semantic), the optional `search` range (should be mandatory per GNU file), and the missing 8192-byte default regex scan window. These are the remaining #39 acceptance-criteria gaps, tracked for v0.3. * Remove the speculative `BinaryRegex` trait sketch -- the PR ships binary-safe matching via `regex::bytes` directly, not via a wrapper trait. * Update "Key Dependencies" to reflect `regex` as a runtime dependency and add `memchr` to the list. * Update the "Adding New Type Support" note to include Regex and Search in the implemented-types list. GOTCHAS.md: * Rewrite section 2.4 to describe the `read_pattern_match` -> `Option<Value>` contract and the hard rejection of non-equality operators on pattern-bearing types (previously the fallthrough to `apply_operator` was tolerated). * Add section 2.5: zero-width regex match vs miss invariant. The `Option` in `read_regex`/`read_search` is load-bearing -- do not re-flatten to `Value::String("")` sentinels. * Add section 2.6: search anchor advance is match-end (per GNU `file` softmagic.c FILE_SEARCH), not window-end. Documents the prior bug and links the ground-truth analysis. docs/src/evaluator.md: * Fix the `read_typed_value` signature documentation. The actual code has a 3-arg `read_typed_value` convenience wrapper alongside the 4-arg `read_typed_value_with_pattern`; the doc previously showed only the 4-arg form with the wrong name. * Add Regex and Search to the enumerated type list. * Document the `read_pattern_match` entry point used by the engine for pattern-bearing types and explain why it returns `Option<Value>`. docs/solutions/integration-issues/implementing-variable-width-typekind-variant.md: * Correct the dispatch description: the implementation adds a sibling `read_typed_value_with_pattern` alongside the existing 3-arg `read_typed_value`, rather than extending the existing signature. The "Prevention" section previously prescribed the opposite of what the code did. * Add the lesson learned about sibling functions vs signature extensions (narrow new concerns favor siblings to avoid touching ~30 existing call sites). * Add the lesson learned about not overloading `Value::String("")` as a miss sentinel (zero-width matches are valid and distinct). * Add the lesson learned about search advancing by match-end (match_idx + pattern.len()), not window-end, per GNU file. * Add the lesson learned about rejecting non-equality operators on pattern-bearing types rather than falling through to apply_operator. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Implements the full set of regex/search flags and semantics specified in the issue #39 acceptance criteria and GNU `file` magic(5). This is the acceptance-criteria completion pass on top of the earlier foundation commit. TypeKind shape redesign: * Add `RegexFlags { case_insensitive, start_offset, line_based }` struct to `src/parser/ast.rs`. The previous bare-`bool` fields on `TypeKind::Regex` are replaced by the flags struct; this matches the design prescribed in the issue #39 plan and keeps the blast radius of future flag additions to a single field. * `TypeKind::Regex` becomes `{ flags: RegexFlags, count: Option<NonZeroU32> }`. The count is the numeric prefix from `regex/N[flags]` (e.g., `1l` means count=1 with line_based=true). `None` means "use the default". * `TypeKind::Search::range` changes from `Option<usize>` to `NonZeroUsize`. Per GNU `file` magic(5), the search range is mandatory; this lifts the constraint into the type system so bare `search` and `search/0` are unrepresentable. The parser now hard- errors on both. * Add `REGEX_MAX_BYTES = 8192` constant to `ast.rs` matching GNU `file`'s `FILE_REGEX_MAX`. Parser: * `src/parser/grammar/mod.rs` accepts interleaved count/flag order (both `regex/1l` and `regex/l1` parse to the same value), matching libmagic's `parse_string_modifier` "last range wins" behavior. * Add `/s` flag parsing (start_offset). * Reject bare `search` and `search/0` at parse time via `NonZeroUsize`. * Reject `regex/0` -- a zero count has no valid semantics in our model since `None` already signals "use the default". * `strip_prefix` instead of direct `&rest[1..]` slicing per the AGENTS.md memory-safety rule. Evaluator (src/evaluator/types/regex.rs): * Rewrite `compute_window` to apply the 8192-byte cap unconditionally and implement line-based scanning. When `flags.line_based` is set, `count` is a line count; the scan walks line terminators (`\n` and `\r\n`, each counting as one) until the Nth terminator or the 8192 cap, whichever comes first. This matches GNU `file`'s `softmagic.c::mget_count` / `FILE_REGEX` path. * Remove the old `/l`-as-`^(?:...)`-anchor wrapping. `build_regex` now unconditionally sets `multi_line(true)` and `dot_matches_new_line(false)` for every pattern, matching libmagic's unconditional `REG_NEWLINE` in `alloc_regex` (softmagic.c:2123). `/l` no longer touches the compiled regex -- it only bounds the scan window. `^` and `$` match at line boundaries for every regex rule regardless of `/l`. * `read_regex` now takes `(flags: RegexFlags, count: Option<NonZeroU32>)` and uses `compute_window` for the scan slice. * `regex_bytes_consumed` implements the `/s` flag: when `flags.start_offset` is set, return `m.start()` (match-start anchor advance) instead of `m.end()` (match-end). This matches libmagic's `REGEX_OFFSET_START` / `moffset()` logic where `vlen = 0` zeros the match-length contribution. * 8192-byte cap applies to all regex scans including explicit counts larger than 8192. This is a DoS mitigation -- without it, a crafted regex against a multi-GB buffer combined with `EvaluationConfig::default()` (no timeout) can hang the evaluator. Evaluator (src/evaluator/types/search.rs): * `read_search` and `search_bytes_consumed` take `NonZeroUsize` instead of `Option<usize>`. Evaluator (src/evaluator/types/mod.rs): * Dispatch updated for the new Regex/Search variant shapes in `read_typed_value_with_pattern`, `read_pattern_match`, and `bytes_consumed_with_pattern`. The regex arm passes `flags` and `count` through; the search arm unwraps the `NonZeroUsize`. Strength calculation: * Search unconditionally gets the "constrained match" bonus (25) since the range is always bounded. * Regex gets the bonus when `count` is `Some` (user specified an explicit scan limit) and the base 20 otherwise. Tests: * `src/evaluator/types/regex.rs::tests`: - Line-based window (1-line cap, 3-line cap, CRLF terminators) - 8192-byte cap on default scans - 8192-byte cap applied to explicit counts larger than 8192 - Small explicit counts honored - `/s` flag returns match-start anchor advance - Multi-line regex matching always on (the old "start of line anchor" tests are replaced by explicit multi-line/`.` behavior tests) * `src/evaluator/types/search.rs::tests`: all tests updated for `NonZeroUsize` signature; unbounded-range tests removed. * `src/parser/grammar/tests/mod.rs`: - `/s` flag accepted - `regex/cs`, `regex/csl` multi-flag combinations accepted - Interleaved count/flag order (`regex/1l` and `regex/l1`) - `search/0` rejected at parse time - `regex/0` rejected at parse time * `src/evaluator/engine/tests.rs`, `tests/evaluator_tests.rs`, `tests/property_tests.rs`: all rule constructions updated for the new TypeKind shapes. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

…iles Adds end-to-end regex/search tests against the corpus files listed as "blocked" in issue #39. These exercise the full evaluator pipeline on real third_party/tests fixtures instead of hand-crafted buffers. Tests added (tests/regex_search_corpus_tests.rs): * `test_searchbug_corpus_search_with_relative_child` -- models the searchbug.magic scenario (which uses `use`/`name`/`offset`/`&0` features we do not yet parse) via programmatic rule construction. Verifies that a `search/12 "ABC"` scan followed by a `Relative(0)` byte child reads the character immediately after the matched pattern, exercising the match-end anchor advance end-to-end. * `test_searchbug_search_anchor_advance_not_window_end` -- regression guard using the same corpus file to assert that the anchor advances by match-end (8 + 3 = 11) and NOT by window-end (8 + 12 = 20). A byte child expecting the window-end byte value must NOT match. * `test_json1_corpus_detected_by_regex` -- JSON detection via `^\s*[\{\[]` regex against `json1.testfile`. * `test_jsonlines1_corpus_detected_by_regex` -- same pattern against `jsonlines1.testfile`. * `test_cmd1_corpus_detected_by_regex` -- shell script detection via `^#![ \t]*/\S+` against `cmd1.testfile`. * `test_gedcom_corpus_detected_by_line_based_regex` -- exercises the `/l` line-based scan window (count = 1) with a `^0 HEAD` pattern against `gedcom.testfile`. * `test_regex_eol_version_extraction` -- smoke test for the simpler non-hierarchical part of the regex-eol scenario. * `test_corpus_files_exist` -- meta-guard so a broken third_party/ checkout fails loudly with file-existence errors instead of confusing evaluation failures. Where a corpus file depends on magic-file features we do not yet support (`use`/`name`/`offset`/`&+N` parser), the test bypasses `parse_text_magic_file` and builds the equivalent rule tree programmatically via the AST, matching the pattern documented in GOTCHAS 3.9. This gets issue #39's "blocked test corpus" coverage to 7 of 14 files (the remaining 7 are variants of the same formats tested here and will pass the same regex/search code paths). Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

…eria AGENTS.md: * Rewrite the "Currently Implemented" bullet for Regex to list all three flags (`/c`, `/s`, `/l`) with their meanings, including the 8192-byte hard cap, the unconditional multi-line regex compilation, and the `line_based` vs `start_offset` vs `case_insensitive` distinction. * Rewrite the "Currently Implemented" bullet for Search to note the mandatory `NonZeroUsize` range and reference GNU `file` magic(5). * Remove the "Current Limitations" entries for regex/search that no longer apply (the `/l`-as-anchor placeholder, the `search` range being optional, and the missing 8192-byte default). All of these were closed by the feature-completion commit. * Remove the "Planned Features" entries for `/s`, `/Nl` line-count semantics (both are now implemented). * Planned-features list now focuses on Aho-Corasick optimization, `!:mime`/`!:ext`/`!:apple` directives, and `use`/`name` named tests. GOTCHAS.md: * Section 2.7: regex `/l` is scan-window bounds only, not a multi-line toggle. Documents the removal of the old `^(?:...)` anchor wrapping and explains why `build_regex` now unconditionally sets `multi_line(true)` for every pattern. * Section 2.8: regex scan window is always capped at 8192 bytes. Lists the three paths this applies to (default, explicit, line-based) and explains the DoS mitigation rationale. Calls out the interaction with GOTCHAS S13.1 (`EvaluationConfig::default()` no timeout). * Section 2.9: regex `/s` flag affects anchor advance only. Documents the libmagic `REGEX_OFFSET_START` / `moffset()` correspondence and notes that tests for `/s` must check `regex_bytes_consumed` or downstream `Relative(N)` resolution -- checking `read_regex` alone won't detect a broken `/s` implementation. * Section 2.6 touched up to clarify that search's `/s`-style start- offset flag is not currently supported (it only applies to regex). Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

mergify · 2026-04-11T05:55:39Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Require conventional commit format per https://www.conventionalcommits.org/en/v1.0.0/. Skipped for bots.

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:$.+$)?!?:

🟢 Full CI must pass

Wonderful, this rule succeeded.

All CI checks must pass. Release-plz PRs are exempt because they only bump versions and changelogs (code was already tested on main), and GITHUB_TOKEN-triggered force-pushes suppress CI.

check-success = coverage
check-success = quality
check-success = test
check-success = test-cross-platform (macos-latest, macOS)
check-success = test-cross-platform (ubuntu-22.04, Linux)
check-success = test-cross-platform (ubuntu-latest, Linux)
check-success = test-cross-platform (windows-latest, Windows)

🟢 Do not merge outdated PRs

Wonderful, this rule succeeded.

Make sure PRs are within 10 commits of the base branch before merging

#commits-behind <= 10

coderabbitai · 2026-04-11T05:55:42Z

Summary by CodeRabbit

New Features
- Implemented Regex and Search pattern types: Regex supports /c, /s, /l flags and optional numeric counts; both enforce bounded scan windows (Regex capped at 8192 bytes). Search requires a /N range and both advance anchors to match end.
Bug Fixes
- Rejects invalid suffixes (e.g., bare search or search/0, empty regex suffix) and limits comparison to Equal/NotEqual for pattern types.
Documentation
- Updated docs and compatibility status for Regex/Search.
Tests
- Added extensive tests for matching, anchors, parsing, and operator validation.

Walkthrough

This PR implements pattern-bearing magic types: adds TypeKind::Regex and TypeKind::Search, parser support for regex/... flags and search/N ranges, evaluator changes to perform binary-safe regex/search matching and to advance anchors by match boundaries, updates strength scoring and tests, and promotes regex to a runtime dependency.

Changes

Cohort / File(s)	Summary
AST & Parser Types `src/parser/ast.rs`, `src/parser/types.rs`	Added `TypeKind::Regex { flags, count }` and `TypeKind::Search { range }`; introduced `RegexFlags` and `REGEX_MAX_BYTES = 8192`; extended keyword parsing to accept `regex`/`search`.
Grammar Parsing `src/parser/grammar/mod.rs`, `src/parser/grammar/tests/mod.rs`	Parse `regex` suffix (`/c`, `/s`, `/l`, optional non-zero decimal count) and mandatory `search/N` (`NonZeroUsize`); added unit tests for valid/invalid suffix forms and rule round-trips.
Codegen `src/parser/codegen.rs`	serialize_type_kind extended to emit `RegexFlags` and `Search` range literals.
Evaluator Types API `src/evaluator/types/mod.rs`, `src/evaluator/types/regex.rs`, `src/evaluator/types/search.rs`	Introduced `read_typed_value_with_pattern(...)`, `read_pattern_match(...)`, and `bytes_consumed_with_pattern(...)`; new `regex` and `search` modules implement GNU-style binary-safe regex (8192-byte cap, flags) and bounded literal search (memchr::memmem), including consumption semantics (advance to match end / special `/s` start-offset).
Evaluator Engine & Tests `src/evaluator/engine/mod.rs`, `src/evaluator/engine/tests.rs`, `src/evaluator/types/tests.rs`	Engine dispatches pattern kinds to `read_pattern_match`, enforces only `=`/`!=` for these types, threads patterns into consumption, and includes unit tests for operator rejection, anchor advancement, zero-width matches, and search window behavior.
Strength Scoring `src/evaluator/strength.rs`	Extended default strength scoring to account for `Regex { count }` and `Search { .. }`.
Docs & Gotchas `AGENTS.md`, `GOTCHAS.md`, `docs/src/*` (evaluator.md, ast-structures.md, parser.md, compatibility.md)	Documented regex/search semantics, flags (`/c`, `/s`, `/l`), 8192-byte cap, API `read_typed_value_with_pattern`/`read_pattern_match`, changed implementation status and gotchas.
Cargo `Cargo.toml`	Promoted `regex = "1.12.3"` to normal dependency and bumped `clap_complete` to `4.6.1`.

Sequence Diagram(s)

sequenceDiagram
    participant Engine as Evaluator Engine
    participant Types as Types Dispatch
    participant Pattern as Regex/Search Module
    participant Anchor as Anchor Computation

    Engine->>Types: evaluate_single_rule_with_anchor(rule, buf, off)
    alt rule.typ == Regex or Search
        Types->>Pattern: read_pattern_match(buf, off, rule.typ, Some(&rule.value))
        Pattern-->>Types: Ok(Some(Value)) / Ok(None) / Err
        Types-->>Engine: matched? (value.is_some())
        Engine->>Engine: allow only '=' or '!=' else Err
        Engine->>Anchor: bytes_consumed_with_pattern(buf, off, rule.typ, Some(&rule.value))
        Anchor-->>Engine: consumed = match_end or match_start (if /s)
    else
        Types->>Types: read_typed_value_with_pattern(buf, off, type_kind, Some/None)
        Types-->>Engine: Value
        Engine->>Engine: apply operators normally
    end
    Engine-->>Engine: advance relative children by consumed

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

chore: resolve all pending TODO items #212 — touches evaluator/engine and evaluator/types dispatch; likely overlaps on pattern-aware evaluation and anchor logic.
feat(parser): implement quad 64-bit integer type with endian variants #133 — modified TypeKind and calculate_default_strength, intersecting with this PR's AST and strength changes.
feat(parser): implement pstring multi-byte length prefix variants (/B, /H, /h, /L, /l, /J) #183 — earlier work on read_typed_value overloads and pattern/threading in evaluator paths; implementation-level overlap.

Poem

✨ Patterns wake and scanners sing,
Flags and ranges, match and bring,
Bytes confined within a bound,
Anchors move where hits are found,
Rules now chase the patterned ring.

🚥 Pre-merge checks | ✅ 9 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Libmagic-Compatibility-Check	⚠️ Warning	Pull request contains documentation inaccuracies regarding search anchor advancement behavior and unsupported regex lookahead features that contradict implementation.	Update ast-structures.md to clarify match-end semantics, remove lookahead references from parser.md and regex.rs, and document unsupported regex features.

✅ Passed checks (9 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title follows Conventional Commits: feat scope is 'evaluator', description clearly indicates regex and search type implementation, includes issue reference.
Description check	✅ Passed	Description thoroughly details the implementation of Regex and Search types, acceptance criteria, test plan, and semantic correctness against GNU file reference source.
Linked Issues check	✅ Passed	PR comprehensively addresses all `#39` acceptance criteria: TypeKind extended [ast.rs], parser recognizes regex/search with flags [grammar.rs], evaluator uses regex::bytes::Regex [types/regex.rs, engine/mod.rs], search supports range-limited searching [types/search.rs], common flags supported [RegexFlags], ~50 unit tests and corpus tests added.
Out of Scope Changes check	✅ Passed	All changes are scoped to implementing Regex and Search types: parser AST/grammar, evaluator types/engine/strength, documentation, Cargo dependency updates. No unrelated refactoring or feature creep detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 85.00%.
Memory-Safety-Check	✅ Passed	No unsafe blocks, transmute operations, or raw pointer arithmetic introduced. All buffer access properly bounds-checked before use with offset validation returning BufferOverrun errors when out of bounds.
Performance-Regression-Check	✅ Passed	Regex pattern double compilation is a deliberately documented trade-off aligned with the project's Safety over performance philosophy, with comprehensive error handling and impact limited to matched rules.
Test-Coverage-Check	✅ Passed	Pull request provides 59 comprehensive tests covering regex/search functionality with unit, integration, and property-based tests validating core logic, error handling, and boundary conditions.
Error-Handling-Check	✅ Passed	Pull request demonstrates excellent error handling following idiomatic Rust patterns with all public APIs returning Result types and no unsafe panicking in library code.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Implements full parser + evaluator support for libmagic’s pattern-bearing regex and search rule types, enabling GNU file-compatible text-format detection and relative-offset chaining needed by the #39 corpus fixtures.

Changes:

Added TypeKind::Regex and TypeKind::Search (plus RegexFlags / REGEX_MAX_BYTES) and parser support for regex/[csl] + optional count and mandatory search/N.
Implemented evaluator readers for regex (binary-safe via regex::bytes) and search (bounded scan via memchr::memmem), including correct anchor advancement for relative offsets.
Added extensive unit + integration + corpus tests and updated project docs/manifests to reflect the new feature set.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/regex_search_corpus_tests.rs	New end-to-end corpus integration tests for regex/search behavior and anchor advancement.
tests/property_tests.rs	Extends `TypeKind` generators to include Regex/Search for property-based coverage.
tests/evaluator_tests.rs	Adds an integration test for the upstream `regex-eol` corpus scenario via programmatic rule trees.
src/parser/types.rs	Recognizes `regex`/`search` keywords and provides placeholder `TypeKind` mapping for round-trip tests.
src/parser/grammar/tests/mod.rs	Adds parser tests covering regex/search suffix forms, errors, and operator adjacency.
src/parser/grammar/mod.rs	Implements parsing of `regex/...` modifiers (flags + count) and mandatory `search/N` range.
src/parser/codegen.rs	Adds codegen serialization for `TypeKind::Regex` and `TypeKind::Search`.
src/parser/ast.rs	Adds `TypeKind::Regex`/`Search`, `RegexFlags`, and `REGEX_MAX_BYTES` definitions + rustdoc.
src/evaluator/types/tests.rs	Updates anchor-consumption tests to use the new pattern-aware consumption path and adds regex/search consumption cases.
src/evaluator/types/search.rs	New bounded literal-search reader and anchor-advance computation.
src/evaluator/types/regex.rs	New binary-safe regex reader with scan-window computation and anchor-advance computation.
src/evaluator/types/mod.rs	Adds pattern-aware read/dispatch (`read_typed_value_with_pattern`, `read_pattern_match`) and `bytes_consumed_with_pattern`.
src/evaluator/strength.rs	Assigns default strength for Regex/Search types.
src/evaluator/engine/tests.rs	Adds engine-level tests for regex/search matching semantics and operator restrictions.
src/evaluator/engine/mod.rs	Updates engine evaluation to special-case pattern-bearing types and thread patterns into anchor advancement.
GOTCHAS.md	Documents new invariants/semantics for pattern-bearing types and corpus-testing guidance.
docs/src/evaluator.md	Documents Regex/Search in evaluator type-reading section and the pattern-aware API.
docs/solutions/integration-issues/implementing-variable-width-typekind-variant.md	New solution doc describing the integration work and pitfalls encountered.
Cargo.toml	Adds `regex` to main dependencies and bumps `clap_complete`.
Cargo.lock	Updates lockfile for `clap_complete` version bump.
AGENTS.md	Updates project capability docs to mark Regex/Search as implemented and describes semantics/limits.
.serena/project.yml	Documentation/formatting updates to Serena project configuration.

Copilot · 2026-04-11T05:58:00Z

 - **QDate**: 64-bit Unix timestamps (signed seconds since epoch) with configurable endianness and UTC/local time formatting
 - **String**: Byte sequences with length limits
 - **PString**: Pascal-style length-prefixed strings with 1-byte (`/B`), 2-byte (`/H` or `/h`), or 4-byte (`/L` or `/l`) length prefixes, supporting big-endian and little-endian byte order
+- **Regex**: Binary-safe regex matching via `regex::bytes::Regex`; the `/c` flag enables case-insensitive matching and `/l` enables multi-line start-of-line anchoring


The docs claim the /l flag “enables multi-line start-of-line anchoring”, but the implementation compiles regexes with multi-line mode unconditionally and /l only changes how the scan window is bounded (line-count vs byte-count). Please update this bullet to reflect the actual /l semantics (and that ^/$ line-boundary behavior is always enabled).

Suggested change

- **Regex**: Binary-safe regex matching via `regex::bytes::Regex`; the `/c` flag enables case-insensitive matching and `/l` enables multi-line start-of-line anchoring

- **Regex**: Binary-safe regex matching via `regex::bytes::Regex`; the `/c` flag enables case-insensitive matching, `^`/`$` line-boundary behavior is always enabled, and `/l` switches scan-window bounding to line-count semantics rather than byte-count semantics

Copilot · 2026-04-11T05:58:00Z

+        /// * `None`: use the 8192-byte default scan window.
+        /// * `Some(n)` with `flags.line_based == false`: scan at most `n`
+        ///   bytes, capped at 8192.
+        /// * `Some(n)` with `flags.line_based == true`: scan at most `n`
+        ///   lines, with an effective byte cap of `min(n * 80, 8192)`.
+        ///
+        /// The 8192-byte hard cap matches GNU `file`'s `FILE_REGEX_MAX` and
+        /// prevents runaway regex scans against large buffers.
+        count: Option<NonZeroU32>,


TypeKind::Regex rustdoc says line-based mode uses an effective byte cap of min(n * 80, 8192), but the evaluator’s compute_window implementation bounds the window by scanning for line terminators up to the Nth line (and always caps at 8192 bytes) — there is no n * 80 behavior. Please adjust the doc comment to match the implemented windowing semantics so consumers don’t rely on the wrong contract.

Copilot · 2026-04-11T05:58:00Z

+/// For `TypeKind::Regex`, the pattern is required to re-run the match and
+/// compute the consumed bytes. When the pattern is unavailable (or not a
+/// string), the function returns `0` -- the anchor will then stay put and
+/// the next relative offset resolves against the previous anchor position,
+/// which is the same graceful-degradation behavior used by the other
+/// defensive paths in this module. For `TypeKind::Search`, the pattern is
+/// not needed because the consumed distance is the entire search window
+/// regardless of where the match was found. Non-pattern types should pass
+/// `pattern: None`.


This doc comment says TypeKind::Search doesn’t need the pattern because “the consumed distance is the entire search window regardless of where the match was found”, but the implementation re-scans and advances by match_idx + pattern.len() (and therefore requires the pattern). Please update the comment to reflect match-end semantics and the need for the pattern operand.

Suggested change

/// For `TypeKind::Regex`, the pattern is required to re-run the match and

/// compute the consumed bytes. When the pattern is unavailable (or not a

/// string), the function returns `0` -- the anchor will then stay put and

/// the next relative offset resolves against the previous anchor position,

/// which is the same graceful-degradation behavior used by the other

/// defensive paths in this module. For `TypeKind::Search`, the pattern is

/// not needed because the consumed distance is the entire search window

/// regardless of where the match was found. Non-pattern types should pass

/// `pattern: None`.

/// For `TypeKind::Regex` and `TypeKind::Search`, the pattern is required

/// to re-run the match and compute the consumed bytes. For search, the

/// consumed distance is measured to the end of the matched occurrence

/// (`match_idx + pattern.len()`), not the full search window. When the

/// pattern is unavailable (or not a string), the function returns `0` --

/// the anchor will then stay put and the next relative offset resolves

/// against the previous anchor position, which is the same

/// graceful-degradation behavior used by the other defensive paths in

/// this module. Non-pattern types should pass `pattern: None`.

Copilot · 2026-04-11T05:58:01Z

+//! pattern within a bounded window. Unlike `TypeKind::String`, which only
+//! matches at the exact offset, `search` advances through the buffer looking
+//! for the first occurrence of the pattern anywhere in the window. The
+//! search window is `buffer[offset..]` capped by the optional `range`.


Module-level docs describe the search window as being capped by an “optional range”, but TypeKind::Search stores range as NonZeroUsize and the grammar rejects bare search/search/0, so the range is mandatory. Please update the docs to avoid implying that an unbounded/optional range is supported.

Suggested change

//! search window is `buffer[offset..]` capped by the optional `range`.

//! search window is `buffer[offset..]` capped by the mandatory `range`

//! (or the remaining buffer length, if smaller).

Copilot · 2026-04-11T05:58:01Z

+**Regex reader** (`src/evaluator/types/regex.rs`) — uses a `build_regex` helper that wraps the pattern in `^(?:...)` when `/l` is set so bare, unanchored patterns cannot match mid-line:
+
+```rust
+fn build_regex(
+    pattern: &str,
+    case_insensitive: bool,
+    start_of_line: bool,
+) -> Result<Regex, regex::Error> {
+    let owned;
+    let effective_pattern: &str = if start_of_line {
+        owned = format!("^(?:{pattern})");
+        &owned
+    } else {
+        pattern
+    };
+    RegexBuilder::new(effective_pattern)
+        .case_insensitive(case_insensitive)
+        .multi_line(start_of_line)
+        .build()
+}
+


This solution doc’s regex semantics (wrapping patterns with ^(?:...) when /l is set and toggling multi_line(start_of_line)) conflict with the current implementation and GOTCHAS: regex multi-line mode is unconditional, and /l controls only scan-window bounding (line-based window), not anchoring. The read_search signature shown here also uses an optional range, which no longer matches the mandatory NonZeroUsize range. Please update this doc to match the merged behavior so it doesn’t mislead future contributors.

dosubot · 2026-04-11T05:59:27Z

Related Documentation

4 document(s) may need updating based on files changed in this PR:

libMagic-rs

ARCHITECTURE `/libmagic-rs/blob/main/docs/ARCHITECTURE.md` — ⏳ Awaiting Merge

ast-structures `/libmagic-rs/blob/main/docs/src/ast-structures.md` — ⏳ Awaiting Merge

compatibility `/libmagic-rs/blob/main/docs/src/compatibility.md` — ⏳ Awaiting Merge

parser `/libmagic-rs/blob/main/docs/src/parser.md` — ⏳ Awaiting Merge

Note: You must be authenticated to accept/decline updates.

^{How did I do? Any feedback?}

codecov · 2026-04-11T06:07:24Z

Codecov Report

❌ Patch coverage is 89.28571% with 60 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/evaluator/types/mod.rs	61.64%	28 Missing ⚠️
src/parser/codegen.rs	0.00%	10 Missing ⚠️
src/evaluator/types/regex.rs	96.52%	8 Missing ⚠️
src/evaluator/strength.rs	0.00%	5 Missing ⚠️
src/evaluator/types/search.rs	95.41%	5 Missing ⚠️
src/parser/grammar/mod.rs	95.00%	4 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/parser/grammar/mod.rs (1)
394-563: 🛠️ Refactor suggestion | 🟠 Major

Extract the regex/search suffix parsers instead of growing parse_type_and_operator further.

This function now owns pstring suffixes, regex modifiers, mandatory search ranges, attached &mask, and fallback operator parsing. The behavior is fine, but the new branches make the operator-boundary rules much harder to audit and push an already-oversized parser file further past the repo’s size target. Pulling the regex/search pieces into parse_regex_suffix(...) / parse_search_suffix(...) helpers would make this logic easier to test and maintain.

As per coding guidelines, "Keep source files under 500-600 lines; split larger files into focused modules" and "Module structure: Use src/parser/ for parsing system with ast.rs, grammar/, types.rs, hierarchy.rs, loader.rs, and codegen.rs".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/parser/grammar/mod.rs` around lines 394 - 563, parse_type_and_operator is
too large and contains inline parsing for regex and search suffixes which should
be extracted; create two helper functions parse_regex_suffix(input: &str) ->
IResult<&str, (RegexFlags, Option<u32>)> and parse_search_suffix(input: &str) ->
IResult<&str, ::std::num::NonZeroUsize> (reuse existing parse_pstring_suffix for
pstring) and replace the large inline blocks in parse_type_and_operator with
calls to these helpers, ensuring they return the remaining rest string and
parsed values (flags/count or range) and preserve the same error semantics and
operator-boundary behavior so later logic (attached_op parsing and
type_keyword_to_kind handling) continues to work unchanged.
docs/src/evaluator.md (1)
595-596: ⚠️ Potential issue | 🟡 Minor

Implementation status is stale after this PR.

The checklist still shows regex support as unimplemented even though this change adds TypeKind::Regex/TypeKind::Search, parser support, evaluator wiring, and tests. That will send readers to the wrong source of truth immediately after merge.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/src/evaluator.md` around lines 595 - 596, The docs checklist is out of
date—mark Regex support as implemented: update the evaluator.md checklist entry
for "Regex type support" to checked or remove it so it reflects that
TypeKind::Regex and TypeKind::Search, parser support, evaluator wiring, and
tests are present; ensure the doc text mentions the implemented symbols
(TypeKind::Regex, TypeKind::Search, parser/evaluator wiring) so readers aren't
directed to the wrong source of truth after merge.
src/evaluator/types/mod.rs (1)
183-239: ⚠️ Potential issue | 🟠 Major

Keep a lossless public API for pattern reads.

Lines 219-225 and 237-238 collapse Ok(None) into Value::String(""), while Lines 261-295 keep the only lossless matcher crate-private. That makes the new public read_typed_value_with_pattern ambiguous for external callers: a regex miss is indistinguishable from a zero-width hit, and a search miss can masquerade as an empty-string match. Please either expose a lossless Option/enum-based API publicly or keep this helper internal until that contract exists.

Also applies to: 261-295
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/evaluator/types/mod.rs` around lines 183 - 239, The public helper
read_typed_value_with_pattern currently collapses None results from
read_regex/read_search into Value::String("") for TypeKind::Regex and
TypeKind::Search, losing the distinction between a miss and an empty/zero-width
match; either change the public API to return an Option/enum that preserves None
(e.g., Result<Option<Value>, TypeReadError> or a small enum MatchResult) or make
read_typed_value_with_pattern crate-private/internal until you add that lossless
return type; update callers accordingly and keep references to TypeKind::Regex,
TypeKind::Search, read_regex, and read_search to locate the affected branches to
implement the lossless return or visibility change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@AGENTS.md`:
- Line 307: Update the AGENTS.md text that claims regex/search coverage is
incomplete: locate the sentence "Note: Currently implemented types are `Byte`,
`Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`, `String`,
`PString`, `Regex`, and `Search`" and the later lines mentioning regex flag
completeness/search range enforcement (the "Current Limitations"/future-phase
notes) and revise them to reflect that `Regex` and `Search` support has been
implemented by this PR—remove or reword the TODO language so the document
consistently states that regex/search flags and range enforcement are
implemented, and update the "Current Limitations" section to only list remaining
true gaps.

In `@docs/src/evaluator.md`:
- Around line 135-157: Update the docs to reflect the current pattern behavior:
clarify that the /l flag no longer enables multiline anchoring but only sets a
line-based window size, that multiline matching is now unconditional (not gated
by /l), and that the engine does not always call read_typed_value_with_pattern
for pattern-bearing rules—evaluate_single_rule_with_anchor now calls
read_pattern_match to preserve zero-width-match vs miss semantics
(read_typed_value remains the None-pattern wrapper). Adjust the paragraphs
describing /l and the engine call path accordingly and reference
read_typed_value_with_pattern, read_pattern_match,
evaluate_single_rule_with_anchor, and read_typed_value when explaining the
corrected behavior.

In `@src/evaluator/engine/tests.rs`:
- Around line 2597-2605: Update the test comment in
test_search_parent_relative_child_at_positive_offset to reflect that this is a
positive-offset sanity check: change phrasing from “Sanity check the negative”
and any mention of `Relative(-N)` to indicate a positive relative offset (e.g.,
`OffsetSpec::Relative(1)`), and ensure the comment correctly describes that a
`Relative(1)` child resolves against the match-end anchor for `MagicRule` in
this test.

In `@src/evaluator/strength.rs`:
- Around line 80-93: The test_test_calculate_default_strength_table currently
doesn't cover the new scoring branches for TypeKind::Regex and TypeKind::Search;
update the test_test_calculate_default_strength_table to explicitly construct
TypeKind::Regex variants (one with count = Some(...) expecting strength 25 and
one with count = None expecting strength 20) and a TypeKind::Search (expecting
strength 25), call calculate_default_strength_table (or the library function
that maps TypeKind -> strength) and assert those expected strength values so
future changes to scoring will fail the test if regressed; reference
TypeKind::Regex, TypeKind::Search, calculate_default_strength_table, and
test_calculate_default_strength_table when adding the assertions.

In `@src/evaluator/types/mod.rs`:
- Around line 375-383: Update the rustdoc for the TypeKind handling: change the
incorrect statement that "TypeKind::Search does not need the pattern and
consumes the full window" to accurately say that TypeKind::Search uses the
provided pattern/needle to find the match end and computes the consumed distance
from the anchor to the end of that match (i.e., callers pass the needle into
search_bytes_consumed to determine the consumed bytes), and mention the fallback
behavior when the pattern is missing or not a string; also update the similar
comment near the search_bytes_consumed usage (the lines that pass the needle
into search_bytes_consumed) so both places describe the current match-end
behavior consistently.

In `@src/evaluator/types/regex.rs`:
- Around line 82-121: The window walker in compute_window (the logic using
buffer.get(offset..), byte_cap, capped and the while loop over capped[idx]) must
be rewritten to use the SafeBufferAccess/.get() pattern everywhere instead of
direct slicing or indexing; replace direct slices like &capped[..count_bytes]
and &capped[..idx] and indexed access capped[idx] and capped[idx+1] with calls
to buffer.get(...) or capped.get(...) and handle Option results (early return or
break) to maintain the same CR/CRLF and line-count behavior, keeping the same
REGEX_MAX_BYTES cap, the flags.line_based branch, and the
target_lines/lines_seen logic but using safe .get() lookups and saturation
semantics.

In `@src/evaluator/types/search.rs`:
- Around line 53-55: Both read_search and search_bytes_consumed create a window
using direct slicing (let remaining = &buffer[offset..]; let window =
&remaining[..window_len];) which violates the SafeBufferAccess pattern; replace
these raw slices with SafeBufferAccess .get() calls to perform bounds-checked
access (e.g. use buffer.get(offset..offset+window_len) or
buffer.get(offset..).and_then(|r| r.get(..window_len))) and handle the Option
result consistently. Update the code paths that compute window_len from
range.get() so they call .get(range_start..range_start+window_len) (or chain
.get on remaining) and return/propagate a safe empty/nonexistent slice when
.get() yields None; apply the same change to the other occurrence around lines
~90-100 so both read_search and search_bytes_consumed use .get() for all buffer
windows.

In `@src/parser/ast.rs`:
- Around line 383-394: The doc for the AST field `count: Option<NonZeroU32>` is
incorrect about line-based behavior; update the rustdoc on `count` (and nearby
comment for `/l`) to reflect the real semantics implemented in
`evaluator::types::regex.rs` (see `compute_window`): when `flags.line_based ==
true` the code advances to the Nth line terminator and only afterward enforces
the 8192-byte hard cap, not an a priori byte estimate of `min(n * 80, 8192)`;
add a short example showing `regex/Nl` walking to the Nth newline then trimming
to 8192 bytes and a brief note about the performance/design tradeoff to justify
walking lines before the cap.

In `@src/parser/codegen.rs`:
- Around line 235-253: The generated code emits
libmagic_rs::parser::ast::RegexFlags which doesn't match in-crate imports; in
the TypeKind::Regex branch of src/parser/codegen.rs (the match arm constructing
the format! string for Regex), replace the path
libmagic_rs::parser::ast::RegexFlags with crate::parser::ast::RegexFlags so the
produced string uses crate::parser::ast::RegexFlags {{ case_insensitive: ...,
start_offset: ..., line_based: ... }} and keep the existing flags.field
interpolations and count handling unchanged.

In `@src/parser/grammar/tests/mod.rs`:
- Around line 2279-2446: Extract the regex/search tests (functions
test_parse_type_and_operator_regex_and_search_suffixes,
test_parse_type_and_operator_search_requires_range,
test_parse_type_and_operator_regex_invalid_suffix,
test_parse_type_and_operator_regex_operator_adjacent,
test_parse_magic_rule_regex_and_search) into a new test file named
src/parser/grammar/tests/regex_search.rs, preserving all use/imports (e.g.
crate::parser::ast::{RegexFlags, TypeKind}, OffsetSpec, Operator, Value) and
helper functions rx and sr; remove those tests from the large mod.rs and add mod
regex_search; to the tests module so the new file is compiled, ensuring any
relative paths remain crate::... so nothing else needs adjusting, then run cargo
test to verify.

In `@src/parser/types.rs`:
- Around line 319-321: Replace the panic-based construction in the
TypeKind::Search mapping by removing the expect() call and using a safe,
explicit non-panicking construction; specifically, change
::std::num::NonZeroUsize::new(1).expect("1 is nonzero") to an explicit unsafe
construction like unsafe { ::std::num::NonZeroUsize::new_unchecked(1) } (with a
brief comment that 1 is statically non-zero) so the library code doesn't call
expect()/panic.

---

Outside diff comments:
In `@docs/src/evaluator.md`:
- Around line 595-596: The docs checklist is out of date—mark Regex support as
implemented: update the evaluator.md checklist entry for "Regex type support" to
checked or remove it so it reflects that TypeKind::Regex and TypeKind::Search,
parser support, evaluator wiring, and tests are present; ensure the doc text
mentions the implemented symbols (TypeKind::Regex, TypeKind::Search,
parser/evaluator wiring) so readers aren't directed to the wrong source of truth
after merge.

In `@src/evaluator/types/mod.rs`:
- Around line 183-239: The public helper read_typed_value_with_pattern currently
collapses None results from read_regex/read_search into Value::String("") for
TypeKind::Regex and TypeKind::Search, losing the distinction between a miss and
an empty/zero-width match; either change the public API to return an Option/enum
that preserves None (e.g., Result<Option<Value>, TypeReadError> or a small enum
MatchResult) or make read_typed_value_with_pattern crate-private/internal until
you add that lossless return type; update callers accordingly and keep
references to TypeKind::Regex, TypeKind::Search, read_regex, and read_search to
locate the affected branches to implement the lossless return or visibility
change.

In `@src/parser/grammar/mod.rs`:
- Around line 394-563: parse_type_and_operator is too large and contains inline
parsing for regex and search suffixes which should be extracted; create two
helper functions parse_regex_suffix(input: &str) -> IResult<&str, (RegexFlags,
Option<u32>)> and parse_search_suffix(input: &str) -> IResult<&str,
::std::num::NonZeroUsize> (reuse existing parse_pstring_suffix for pstring) and
replace the large inline blocks in parse_type_and_operator with calls to these
helpers, ensuring they return the remaining rest string and parsed values
(flags/count or range) and preserve the same error semantics and
operator-boundary behavior so later logic (attached_op parsing and
type_keyword_to_kind handling) continues to work unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ad7f0fa2-ae76-4ad8-a150-7da5774a11fa

📥 Commits

Reviewing files that changed from the base of the PR and between 4c050d6 and 2756f8c.

⛔ Files ignored due to path filters (6)

.serena/project.yml is excluded by none and included by none
Cargo.lock is excluded by !**/*.lock and included by none
docs/solutions/integration-issues/implementing-variable-width-typekind-variant.md is excluded by none and included by none
tests/evaluator_tests.rs is excluded by none and included by none
tests/property_tests.rs is excluded by none and included by none
tests/regex_search_corpus_tests.rs is excluded by none and included by none

📒 Files selected for processing (16)

AGENTS.md
Cargo.toml
GOTCHAS.md
docs/src/evaluator.md
src/evaluator/engine/mod.rs
src/evaluator/engine/tests.rs
src/evaluator/strength.rs
src/evaluator/types/mod.rs
src/evaluator/types/regex.rs
src/evaluator/types/search.rs
src/evaluator/types/tests.rs
src/parser/ast.rs
src/parser/codegen.rs
src/parser/grammar/mod.rs
src/parser/grammar/tests/mod.rs
src/parser/types.rs

coderabbitai · 2026-04-11T06:09:38Z

 ### Adding New Type Support

-> **Note:** Currently implemented types are `Byte`, `Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`, `String`, and `PString`. Regex and search types are planned for future releases.
+> **Note:** Currently implemented types are `Byte`, `Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`, `String`, `PString`, `Regex`, and `Search`. See "Current Limitations" for the remaining gaps in regex/search flag coverage.


⚠️ Potential issue | 🟡 Minor

AGENTS still presents regex/search work as incomplete.

Line 307 refers readers to “remaining gaps in regex/search flag coverage”, and Lines 462-463 still put regex flag completeness/search range enforcement in a future phase. That now conflicts with the implementation this PR adds, so future contributors will get mixed guidance from the same document.

Also applies to: 462-463

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@AGENTS.md` at line 307, Update the AGENTS.md text that claims regex/search coverage is incomplete: locate the sentence "Note: Currently implemented types are `Byte`, `Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`, `String`, `PString`, `Regex`, and `Search`" and the later lines mentioning regex flag completeness/search range enforcement (the "Current Limitations"/future-phase notes) and revise them to reflect that `Regex` and `Search` support has been implemented by this PR—remove or reword the TODO language so the document consistently states that regex/search flags and range enforcement are implemented, and update the "Current Limitations" section to only list remaining true gaps.

coderabbitai · 2026-04-11T06:09:38Z

+- **Regex**: Binary-safe regex matching via `regex::bytes::Regex`; the `/c` flag enables case-insensitive matching and `/l` enables multi-line start-of-line anchoring
+- **Search**: Bounded literal pattern scan via `memchr::memmem::find`; `search/N` caps the scan window to `N` bytes from the offset
 - **Bounds checking**: Prevents buffer overruns

 ```rust
+// Non-pattern types use the 3-arg convenience wrapper:
 pub fn read_typed_value(
    buffer: &[u8],
    offset: usize,
    type_kind: &TypeKind,
 ) -> Result<Value, TypeReadError>
+
+// Pattern-bearing types (Regex, Search) thread the rule's value operand
+// through as the match pattern:
+pub fn read_typed_value_with_pattern(
+    buffer: &[u8],
+    offset: usize,
+    type_kind: &TypeKind,
+    pattern: Option<&Value>,
+) -> Result<Value, TypeReadError>
 ```

+The engine uses `read_typed_value_with_pattern` uniformly and passes `Some(&rule.value)` for every rule; the convenience `read_typed_value` is a thin wrapper that forwards `pattern: None`. For pattern-bearing types a genuine "no match" is collapsed to `Value::String(String::new())` in the `read_typed_value_with_pattern` return so the back-compat `Value` shape is preserved; the engine instead calls `read_pattern_match` directly, which returns `Result<Option<Value>, _>` so zero-width matches (e.g. `^`, `a*`) can be distinguished from genuine misses.


⚠️ Potential issue | 🟡 Minor

These docs still describe the pre-fix regex/search behavior.

Line 135 says /l enables multiline anchoring, but the implementation now treats /l only as line-based window sizing; multiline matching is unconditional. The paragraph at Line 157 also says the engine “uses read_typed_value_with_pattern uniformly”, but evaluate_single_rule_with_anchor now special-cases pattern-bearing types through read_pattern_match specifically to preserve zero-width-match vs miss semantics.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/src/evaluator.md` around lines 135 - 157, Update the docs to reflect the current pattern behavior: clarify that the /l flag no longer enables multiline anchoring but only sets a line-based window size, that multiline matching is now unconditional (not gated by /l), and that the engine does not always call read_typed_value_with_pattern for pattern-bearing rules—evaluate_single_rule_with_anchor now calls read_pattern_match to preserve zero-width-match vs miss semantics (read_typed_value remains the None-pattern wrapper). Adjust the paragraphs describing /l and the engine call path accordingly and reference read_typed_value_with_pattern, read_pattern_match, evaluate_single_rule_with_anchor, and read_typed_value when explaining the corrected behavior.

coderabbitai · 2026-04-11T06:09:38Z

+/// Sanity check the negative: when the parent search finds the pattern
+/// early in the window, a Relative(-N) child should still resolve against
+/// the match-end anchor. This catches a class of bugs where the anchor
+/// update uses the wrong base offset.
+#[test]
+fn test_search_parent_relative_child_at_positive_offset() {
+    // Buffer: "prefix_NEEDLE_after_stuff" -- "NEEDLE" is at index 7, len
+    // 6, match-end = 13. A Relative(1) child should read byte 14 = 'a'.
+    let child = MagicRule {


⚠️ Potential issue | 🟡 Minor

Correct the test comment wording (“negative” vs positive offset).

The comment says “Sanity check the negative” and references Relative(-N), but the test uses OffsetSpec::Relative(1). Update wording to avoid future confusion.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/evaluator/engine/tests.rs` around lines 2597 - 2605, Update the test comment in test_search_parent_relative_child_at_positive_offset to reflect that this is a positive-offset sanity check: change phrasing from “Sanity check the negative” and any mention of `Relative(-N)` to indicate a positive relative offset (e.g., `OffsetSpec::Relative(1)`), and ensure the comment correctly describes that a `Relative(1)` child resolves against the match-end anchor for `MagicRule` in this test.

coderabbitai · 2026-04-11T06:09:38Z

+        // Regex matches a pattern -- treat similarly to an unbounded string.
+        // A rule with an explicit `count` is more constrained (narrower scan
+        // window) and therefore more specific.
+        TypeKind::Regex { count, .. } => {
+            if count.is_some() {
+                25
+            } else {
+                20
+            }
+        }
+        // Search is always a bounded scan (the range is mandatory), so it
+        // gets the "constrained match" bonus unconditionally. This matches
+        // the max_length bonus used for String and PString.
+        TypeKind::Search { .. } => 25,


🧹 Nitpick | 🔵 Trivial

Add explicit strength tests for Regex and Search branches.

This introduces new scoring paths but test_calculate_default_strength_table doesn’t assert them yet. Add explicit cases so future tuning doesn’t regress silently.

As per coding guidelines: src/**/*.rs: Every PR must include tests for new functionality.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/evaluator/strength.rs` around lines 80 - 93, The test_test_calculate_default_strength_table currently doesn't cover the new scoring branches for TypeKind::Regex and TypeKind::Search; update the test_test_calculate_default_strength_table to explicitly construct TypeKind::Regex variants (one with count = Some(...) expecting strength 25 and one with count = None expecting strength 20) and a TypeKind::Search (expecting strength 25), call calculate_default_strength_table (or the library function that maps TypeKind -> strength) and assert those expected strength values so future changes to scoring will fail the test if regressed; reference TypeKind::Regex, TypeKind::Search, calculate_default_strength_table, and test_calculate_default_strength_table when adding the assertions.

coderabbitai · 2026-04-11T06:09:38Z

+/// For `TypeKind::Regex`, the pattern is required to re-run the match and
+/// compute the consumed bytes. When the pattern is unavailable (or not a
+/// string), the function returns `0` -- the anchor will then stay put and
+/// the next relative offset resolves against the previous anchor position,
+/// which is the same graceful-degradation behavior used by the other
+/// defensive paths in this module. For `TypeKind::Search`, the pattern is
+/// not needed because the consumed distance is the entire search window
+/// regardless of where the match was found. Non-pattern types should pass
+/// `pattern: None`.


⚠️ Potential issue | 🟡 Minor

Fix the search anchor rustdoc.

Lines 380-382 say TypeKind::Search does not need the pattern and consumes the full window, but Lines 459-465 pass the needle into search_bytes_consumed. Please update the comment to describe the current match-end behavior; right now the docs encode the wrong invariant for future refactors.

Also applies to: 459-465

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/evaluator/types/mod.rs` around lines 375 - 383, Update the rustdoc for the TypeKind handling: change the incorrect statement that "TypeKind::Search does not need the pattern and consumes the full window" to accurately say that TypeKind::Search uses the provided pattern/needle to find the match end and computes the consumed distance from the anchor to the end of that match (i.e., callers pass the needle into search_bytes_consumed to determine the consumed bytes), and mention the fallback behavior when the pattern is missing or not a string; also update the similar comment near the search_bytes_consumed usage (the lines that pass the needle into search_bytes_consumed) so both places describe the current match-end behavior consistently.

coderabbitai · 2026-04-11T06:09:38Z

+    let remaining = &buffer[offset..];
+    let window_len = range.get().min(remaining.len());
+    let window = &remaining[..window_len];


🛠️ Refactor suggestion | 🟠 Major

Replace direct slices with .get()-based window construction.

Both read_search and search_bytes_consumed build their scan windows with direct slicing. That works because window_len is pre-clamped, but evaluator code in this repo is supposed to stay on the .get()/SafeBufferAccess path consistently.

Proposed direction

- let window = &remaining[..window_len]; + let Some(window) = remaining.get(..window_len) else { + return Ok(None); + };

- let window = &remaining[..window_len]; + let Some(window) = remaining.get(..window_len) else { + return 0; + };

As per coding guidelines, "Use SafeBufferAccess trait pattern from src/io/mod.rs for bounds checking" and "All buffer access must use bounds checking with `.get()` methods. No raw pointer arithmetic or transmute operations."
Also applies to: 90-100

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/evaluator/types/search.rs` around lines 53 - 55, Both read_search and search_bytes_consumed create a window using direct slicing (let remaining = &buffer[offset..]; let window = &remaining[..window_len];) which violates the SafeBufferAccess pattern; replace these raw slices with SafeBufferAccess .get() calls to perform bounds-checked access (e.g. use buffer.get(offset..offset+window_len) or buffer.get(offset..).and_then(|r| r.get(..window_len))) and handle the Option result consistently. Update the code paths that compute window_len from range.get() so they call .get(range_start..range_start+window_len) (or chain .get on remaining) and return/propagate a safe empty/nonexistent slice when .get() yields None; apply the same change to the other occurrence around lines ~90-100 so both read_search and search_bytes_consumed use .get() for all buffer windows.

coderabbitai · 2026-04-11T06:09:38Z

+        /// Optional numeric count from `regex/N[flags]`. Interpretation
+        /// depends on `flags.line_based`:
+        ///
+        /// * `None`: use the 8192-byte default scan window.
+        /// * `Some(n)` with `flags.line_based == false`: scan at most `n`
+        ///   bytes, capped at 8192.
+        /// * `Some(n)` with `flags.line_based == true`: scan at most `n`
+        ///   lines, with an effective byte cap of `min(n * 80, 8192)`.
+        ///
+        /// The 8192-byte hard cap matches GNU `file`'s `FILE_REGEX_MAX` and
+        /// prevents runaway regex scans against large buffers.
+        count: Option<NonZeroU32>,


🧹 Nitpick | 🔵 Trivial

Document /l the way compute_window actually behaves.

These field docs say line-based mode has an effective byte cap of min(n * 80, 8192), but the implementation in src/evaluator/types/regex.rs walks forward to the Nth line terminator and only then applies the 8192-byte hard cap. As written, the public AST docs give the wrong mental model for regex/Nl.

Proposed doc fix

- /// * `Some(n)` with `flags.line_based == true`: scan at most `n` - /// lines, with an effective byte cap of `min(n * 80, 8192)`. + /// * `Some(n)` with `flags.line_based == true`: scan through the + /// end of the Nth line terminator, capped at 8192 bytes.

As per coding guidelines, "Public APIs require rustdoc with examples" and "Explain design decisions and trade-offs in code comments".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/parser/ast.rs` around lines 383 - 394, The doc for the AST field `count: Option<NonZeroU32>` is incorrect about line-based behavior; update the rustdoc on `count` (and nearby comment for `/l`) to reflect the real semantics implemented in `evaluator::types::regex.rs` (see `compute_window`): when `flags.line_based == true` the code advances to the Nth line terminator and only afterward enforces the 8192-byte hard cap, not an a priori byte estimate of `min(n * 80, 8192)`; add a short example showing `regex/Nl` walking to the Nth newline then trimming to 8192 bytes and a brief note about the performance/design tradeoff to justify walking lines before the cap.

coderabbitai · 2026-04-11T06:09:38Z

+        TypeKind::Regex { flags, count } => {
+            let count_lit = match count {
+                Some(n) => format!("::std::num::NonZeroU32::new({}).unwrap()", n.get()),
+                None => String::new(),
+            };
+            let count_expr = if count.is_some() {
+                format!("Some({count_lit})")
+            } else {
+                "None".to_string()
+            };
+            format!(
+                "TypeKind::Regex {{ flags: libmagic_rs::parser::ast::RegexFlags {{ case_insensitive: {}, start_offset: {}, line_based: {} }}, count: {count_expr} }}",
+                flags.case_insensitive, flags.start_offset, flags.line_based
+            )
+        }
+        TypeKind::Search { range } => format!(
+            "TypeKind::Search {{ range: ::std::num::NonZeroUsize::new({}).unwrap() }}",
+            range.get()
+        ),


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "1) Check whether crate self-alias exists:" rg -n --type=rust 'extern crate self as libmagic_rs' src build.rs || true echo echo "2) Find generated-source import style and emitted RegexFlags path:" rg -n --type=rust 'use crate::parser::ast::|libmagic_rs::parser::ast::RegexFlags' src/parser/codegen.rs echo echo "3) Find where generated builtin rules are included/consumed:" rg -n --type=rust 'generate_builtin_rules|BUILTIN_RULES|include!\(' src build.rs

Repository: EvilBit-Labs/libmagic-rs

Length of output: 2533

Fix generated code path for RegexFlags to use crate:: prefix.

The codegen emits libmagic_rs::parser::ast::RegexFlags on line 246, but generated code imports from crate::parser::ast::... (line 32) with no crate self-alias. This causes a compilation error when Regex rules are serialized. Change to crate::parser::ast::RegexFlags to match the generated import style and ensure in-crate resolution.

Suggested patch

- format!( - "TypeKind::Regex {{ flags: libmagic_rs::parser::ast::RegexFlags {{ case_insensitive: {}, start_offset: {}, line_based: {} }}, count: {count_expr} }}", - flags.case_insensitive, flags.start_offset, flags.line_based - ) + format!( + "TypeKind::Regex {{ flags: crate::parser::ast::RegexFlags {{ case_insensitive: {}, start_offset: {}, line_based: {} }}, count: {count_expr} }}", + flags.case_insensitive, flags.start_offset, flags.line_based + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

TypeKind::Regex { flags, count } => {

let count_lit = match count {

Some(n) => format!("::std::num::NonZeroU32::new({}).unwrap()", n.get()),

None => String::new(),

};

let count_expr = if count.is_some() {

format!("Some({count_lit})")

} else {

"None".to_string()

};

format!(

"TypeKind::Regex {{ flags: libmagic_rs::parser::ast::RegexFlags {{ case_insensitive: {}, start_offset: {}, line_based: {} }}, count: {count_expr} }}",

flags.case_insensitive, flags.start_offset, flags.line_based

)

}

TypeKind::Search { range } => format!(

"TypeKind::Search {{ range: ::std::num::NonZeroUsize::new({}).unwrap() }}",

range.get()

),

TypeKind::Regex { flags, count } => {

let count_lit = match count {

Some(n) => format!("::std::num::NonZeroU32::new({}).unwrap()", n.get()),

None => String::new(),

};

let count_expr = if count.is_some() {

format!("Some({count_lit})")

} else {

"None".to_string()

};

format!(

"TypeKind::Regex {{ flags: crate::parser::ast::RegexFlags {{ case_insensitive: {}, start_offset: {}, line_based: {} }}, count: {count_expr} }}",

flags.case_insensitive, flags.start_offset, flags.line_based

)

}

TypeKind::Search { range } => format!(

"TypeKind::Search {{ range: ::std::num::NonZeroUsize::new({}).unwrap() }}",

range.get()

),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/parser/codegen.rs` around lines 235 - 253, The generated code emits libmagic_rs::parser::ast::RegexFlags which doesn't match in-crate imports; in the TypeKind::Regex branch of src/parser/codegen.rs (the match arm constructing the format! string for Regex), replace the path libmagic_rs::parser::ast::RegexFlags with crate::parser::ast::RegexFlags so the produced string uses crate::parser::ast::RegexFlags {{ case_insensitive: ..., start_offset: ..., line_based: ... }} and keep the existing flags.field interpolations and count handling unchanged.

coderabbitai · 2026-04-11T06:09:39Z

+#[test]
+fn test_parse_type_and_operator_regex_and_search_suffixes() {
+    use crate::parser::ast::{RegexFlags, TypeKind};
+    use std::num::{NonZeroU32, NonZeroUsize};
+
+    fn rx(case: bool, start: bool, line: bool, count: Option<u32>) -> TypeKind {
+        TypeKind::Regex {
+            flags: RegexFlags {
+                case_insensitive: case,
+                start_offset: start,
+                line_based: line,
+            },
+            count: count.and_then(NonZeroU32::new),
+        }
+    }
+    fn sr(n: usize) -> TypeKind {
+        TypeKind::Search {
+            range: NonZeroUsize::new(n).unwrap(),
+        }
+    }
+
+    let cases: &[(&str, TypeKind, &str)] = &[
+        ("regex", rx(false, false, false, None), ""),
+        ("regex/c", rx(true, false, false, None), ""),
+        ("regex/l", rx(false, false, true, None), ""),
+        ("regex/s", rx(false, true, false, None), ""),
+        ("regex/cl", rx(true, false, true, None), ""),
+        ("regex/lc", rx(true, false, true, None), ""),
+        ("regex/cs", rx(true, true, false, None), ""),
+        ("regex/csl", rx(true, true, true, None), ""),
+        ("regex/1l", rx(false, false, true, Some(1)), ""),
+        ("regex/l1", rx(false, false, true, Some(1)), ""),
+        ("regex/1c", rx(true, false, false, Some(1)), ""),
+        ("regex/256", rx(false, false, false, Some(256)), ""),
+        ("regex/c =", rx(true, false, false, None), "="),
+        ("search/256", sr(256), ""),
+        ("search/1", sr(1), ""),
+        ("search/256 =", sr(256), "="),
+    ];
+    for &(input, ref expected_kind, expected_rest) in cases {
+        let (rest, (kind, op)) = parse_type_and_operator(input).expect(input);
+        assert_eq!(rest, expected_rest, "rest for input: {input}");
+        assert!(op.is_none(), "operator for input: {input}");
+        assert_eq!(&kind, expected_kind, "kind for input: {input}");
+    }
+}
+
+#[test]
+fn test_parse_type_and_operator_search_requires_range() {
+    // Bare `search` (no /N suffix) is a hard parse error per GNU `file`.
+    assert!(parse_type_and_operator("search").is_err());
+    // `search/0` is also rejected -- `NonZeroUsize` makes a zero-width
+    // scan unrepresentable.
+    assert!(parse_type_and_operator("search/0").is_err());
+}
+
+#[test]
+fn test_parse_type_and_operator_regex_invalid_suffix() {
+    // Bare slash with no flags or count
+    assert!(parse_type_and_operator("regex/").is_err());
+    // Unrecognized flag letter
+    assert!(parse_type_and_operator("regex/z").is_err());
+    // Non-operator trailing character is still rejected
+    assert!(parse_type_and_operator("regex/cz").is_err());
+    // regex/0 is rejected because a zero count has no valid semantics
+    // (our parser uses NonZeroU32 to express "user specified a count").
+    assert!(parse_type_and_operator("regex/0").is_err());
+}
+
+#[test]
+fn test_parse_type_and_operator_regex_operator_adjacent() {
+    use crate::parser::ast::{Operator, RegexFlags, TypeKind};
+
+    // `regex/c=` should leave `=` for parse_operator, matching the `regex/c =`
+    // (space-separated) behavior and mirroring `search/256=`.
+    let (rest, (kind, op)) = parse_type_and_operator("regex/c=").expect("regex/c=");
+    assert_eq!(rest, "=");
+    assert!(op.is_none());
+    assert_eq!(
+        kind,
+        TypeKind::Regex {
+            flags: RegexFlags {
+                case_insensitive: true,
+                ..RegexFlags::default()
+            },
+            count: None,
+        }
+    );
+
+    // `regex/l!=` should leave `!=` for parse_operator.
+    let (rest, (kind, op)) = parse_type_and_operator("regex/l!=").expect("regex/l!=");
+    assert_eq!(rest, "!=");
+    assert!(op.is_none());
+    assert_eq!(
+        kind,
+        TypeKind::Regex {
+            flags: RegexFlags {
+                line_based: true,
+                ..RegexFlags::default()
+            },
+            count: None,
+        }
+    );
+
+    // Confirm the full pipeline parses the operator correctly through
+    // parse_type_and_operator + parse_operator chaining.
+    let (rest, (_, _)) = parse_type_and_operator("regex/c=foo").expect("regex/c=foo");
+    let (rest_after_op, op) = crate::parser::grammar::parse_operator(rest).expect("operator");
+    assert_eq!(op, Operator::Equal);
+    assert_eq!(rest_after_op, "foo");
+}
+
+#[test]
+fn test_parse_magic_rule_regex_and_search() {
+    use crate::parser::ast::RegexFlags;
+    use std::num::{NonZeroU32, NonZeroUsize};
+
+    // regex/c: case-insensitive flag
+    let input = r#"0 regex/c "hello" case-insensitive match"#;
+    let (remaining, rule) = parse_magic_rule(input).unwrap();
+    assert_eq!(remaining, "");
+    assert_eq!(rule.offset, OffsetSpec::Absolute(0));
+    assert_eq!(
+        rule.typ,
+        TypeKind::Regex {
+            flags: RegexFlags {
+                case_insensitive: true,
+                ..RegexFlags::default()
+            },
+            count: None,
+        }
+    );
+    assert_eq!(rule.op, Operator::Equal);
+    assert_eq!(rule.value, Value::String("hello".to_string()));
+    assert_eq!(rule.message, "case-insensitive match");
+
+    // search/256
+    let input = r#"0 search/256 "MZ" DOS executable"#;
+    let (remaining, rule) = parse_magic_rule(input).unwrap();
+    assert_eq!(remaining, "");
+    assert_eq!(
+        rule.typ,
+        TypeKind::Search {
+            range: NonZeroUsize::new(256).unwrap(),
+        }
+    );
+    assert_eq!(rule.op, Operator::Equal);
+    assert_eq!(rule.value, Value::String("MZ".to_string()));
+    assert_eq!(rule.message, "DOS executable");
+
+    // regex/1l: line-based with a count of 1 (mirrors regex-eol.magic
+    // syntax). The count is now preserved, not discarded.
+    let input = r#">1 regex/1l "[0-9]+" version line"#;
+    let (remaining, rule) = parse_magic_rule(input).unwrap();
+    assert_eq!(remaining, "");
+    assert_eq!(rule.level, 1);
+    assert_eq!(
+        rule.typ,
+        TypeKind::Regex {
+            flags: RegexFlags {
+                line_based: true,
+                ..RegexFlags::default()
+            },
+            count: NonZeroU32::new(1),
+        }
+    );
+    assert_eq!(rule.message, "version line");
+}


🧹 Nitpick | 🔵 Trivial

Split new regex/search tests into a dedicated test module.

Coverage is solid, but this file is already far beyond the repository size guideline. Moving this block to src/parser/grammar/tests/regex_search.rs will keep future maintenance manageable.

As per coding guidelines: src/**/*.rs: Keep source files under 500-600 lines; split larger files into focused modules.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/parser/grammar/tests/mod.rs` around lines 2279 - 2446, Extract the regex/search tests (functions test_parse_type_and_operator_regex_and_search_suffixes, test_parse_type_and_operator_search_requires_range, test_parse_type_and_operator_regex_invalid_suffix, test_parse_type_and_operator_regex_operator_adjacent, test_parse_magic_rule_regex_and_search) into a new test file named src/parser/grammar/tests/regex_search.rs, preserving all use/imports (e.g. crate::parser::ast::{RegexFlags, TypeKind}, OffsetSpec, Operator, Value) and helper functions rx and sr; remove those tests from the large mod.rs and add mod regex_search; to the tests module so the new file is compiled, ensuring any relative paths remain crate::... so nothing else needs adjusting, then run cargo test to verify.

coderabbitai · 2026-04-11T06:09:39Z

+        "search" => TypeKind::Search {
+            range: ::std::num::NonZeroUsize::new(1).expect("1 is nonzero"),
+        },


🛠️ Refactor suggestion | 🟠 Major

Remove expect() from library mapping path.

This is a constant path; avoid panic-based construction here.

Suggested patch

"search" => TypeKind::Search { - range: ::std::num::NonZeroUsize::new(1).expect("1 is nonzero"), + range: ::std::num::NonZeroUsize::MIN, },

As per coding guidelines: src/**/*.rs: No unwrap() or expect() in library code.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"search" => TypeKind::Search {

range: ::std::num::NonZeroUsize::new(1).expect("1 is nonzero"),

},

"search" => TypeKind::Search {

range: ::std::num::NonZeroUsize::MIN,

},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/parser/types.rs` around lines 319 - 321, Replace the panic-based construction in the TypeKind::Search mapping by removing the expect() call and using a safe, explicit non-panicking construction; specifically, change ::std::num::NonZeroUsize::new(1).expect("1 is nonzero") to an explicit unsafe construction like unsafe { ::std::num::NonZeroUsize::new_unchecked(1) } (with a brief comment that 1 is statically non-zero) so the library code doesn't call expect()/panic.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/src/ast-structures.md`:
- Around line 454-456: Update the documentation entry describing search anchor
behavior so it accurately states that the anchor advances to the end of the
matched pattern (match-end) rather than by the full search window; specifically
modify the bullet that currently reads "Anchor advances by the entire search
window" to "Anchor advances to the end of the matched pattern (match-end)".
Ensure the description still notes the return type `Value::String` for matched
bytes and that only `Equal` and `NotEqual` operators are supported so consumers
of the Search/anchor semantics have correct expectations.

In `@docs/src/parser.md`:
- Line 353: The docs sentence incorrectly lists "lookaheads" as an example of
zero-width matches; update the wording near the `Value::String("")` example so
it only mentions supported constructs (e.g., `^`, `$`, `a*`) and remove any
reference to look-around syntax, since `regex::bytes::Regex` does not support
lookaheads; ensure the sentence still explains that zero-width matches are
preserved as `Value::String("")` and distinguish them from genuine misses.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 947f1be6-454e-4b11-9a53-23e230972195

📥 Commits

Reviewing files that changed from the base of the PR and between 2756f8c and a13dd13.

⛔ Files ignored due to path filters (1)

docs/ARCHITECTURE.md is excluded by none and included by none

📒 Files selected for processing (3)

docs/src/ast-structures.md
docs/src/compatibility.md
docs/src/parser.md

coderabbitai · 2026-04-11T06:15:22Z

+- Returns `Value::String` containing the matched bytes if found within range
+- Anchor advances by the entire search window regardless of where the match was found
+- Only supports `Equal` and `NotEqual` operators


⚠️ Potential issue | 🟠 Major

Fix Search anchor advancement semantics.

The docs currently say the anchor advances by the full search window. This should state it advances to the end of the matched pattern (match-end), which is the implemented behavior.

As per coding guidelines: "docs/**: Review documentation quality, completeness, and accuracy. Ensure comprehensive API documentation, usage examples, and migration guides. Focus on user experience and developer onboarding."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/src/ast-structures.md` around lines 454 - 456, Update the documentation entry describing search anchor behavior so it accurately states that the anchor advances to the end of the matched pattern (match-end) rather than by the full search window; specifically modify the bullet that currently reads "Anchor advances by the entire search window" to "Anchor advances to the end of the matched pattern (match-end)". Ensure the description still notes the return type `Value::String` for matched bytes and that only `Equal` and `NotEqual` operators are supported so consumers of the Search/anchor semantics have correct expectations.

coderabbitai · 2026-04-11T06:15:22Z

+
+- Patterns are compiled with multi-line mode always enabled (matching libmagic's unconditional `REG_NEWLINE`), so `^` and `$` match at line boundaries and `.` does not match `\n`.
+- The scan window is always capped at 8192 bytes regardless of the `count` value.
+- Zero-width matches (`^`, `a*`, lookaheads) are preserved as `Value::String("")` and distinguished from genuine misses.


⚠️ Potential issue | 🟠 Major

Remove unsupported “lookaheads” from regex semantics examples.

regex::bytes::Regex does not support look-around syntax, so documenting lookaheads here is inaccurate and can mislead rule authors. Keep zero-width examples to supported constructs (e.g., ^, $, a*).

As per coding guidelines: "docs/**: Review documentation quality, completeness, and accuracy. Ensure comprehensive API documentation, usage examples, and migration guides. Focus on user experience and developer onboarding."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/src/parser.md` at line 353, The docs sentence incorrectly lists "lookaheads" as an example of zero-width matches; update the wording near the `Value::String("")` example so it only mentions supported constructs (e.g., `^`, `$`, `a*`) and remove any reference to look-around syntax, since `regex::bytes::Regex` does not support lookaheads; ensure the sentence still explains that zero-width matches are preserved as `Value::String("")` and distinguish them from genuine misses.

Post-review fixes from PR #214 internal review round 3. Grammar: * `src/parser/grammar/mod.rs`: reject a second numeric count in a regex suffix as a hard parse error (e.g., `regex/1l2l`, `regex/1c2l`, `regex/l1l2`). Libmagic accepts these with a "multiple ranges" stderr warning; we prefer a hard error so magic-file bugs surface at parse time rather than silently using the last-seen count. Tests: * `src/parser/grammar/tests/mod.rs`: add `test_parse_type_and_operator_regex_rejects_duplicate_count` covering the three duplicate-count forms above plus a `regex/l0` rejection. * `src/evaluator/types/regex.rs`: add 8192-byte boundary regression tests: - `test_read_regex_pattern_ending_exactly_at_byte_8192_matches` -- pattern at indices 8186..8192 must match (last byte is 8191, inside the cap). - `test_read_regex_pattern_starting_at_byte_8192_misses` -- pattern at index 8192 (one byte past cap) must miss. - `test_read_regex_pattern_straddling_8192_boundary_misses` -- pattern at indices 8190..8196 (first 2 bytes inside cap, last 4 past) must miss. - `test_read_regex_line_based_respects_8192_cap` -- line mode with count=None on a 9000-byte buffer must cap at 8192. * `src/evaluator/types/regex.rs`: add classic-Mac CR terminator regression tests: - `test_read_regex_line_based_bare_cr_terminator` -- bare `\r` alone counts as a single line terminator (matches GNU `file`'s memchr2('\n', '\r') loop). - `test_read_regex_line_based_mixed_terminators` -- buffer with LF, CRLF, and CR each counting as one line; count=2 stops after the second terminator (a CRLF) so line 3 is not scanned. These close the "8192 boundary untested" and "bare-CR untested" coverage gaps flagged in round 3 review. The duplicate-count rejection closes the "silent last-wins" divergence from libmagic that was also flagged. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Post-merge cleanup of doc drift found in PR #214 review round 3. src/parser/ast.rs: * Remove the "effective byte cap of `min(n * 80, 8192)`" clause from the `TypeKind::Regex::count` rustdoc. The `n * 80` heuristic does not exist in `compute_window`; the code applies only the flat 8192 cap and walks line terminators. Reword to describe the actual behavior (8192 cap always applies regardless of count). src/evaluator/types/mod.rs: * Rewrite the `bytes_consumed_with_pattern` docstring section on pattern-bearing types. The old doc claimed Search's consumed distance "is the entire search window regardless of where the match was found" -- this is the pre-fix window-end bug, not the current match-end behavior. Rewrite to describe GNU `file` semantics: regex advances by `m.end()` (or `m.start()` with `/s`), search advances by `match_idx + pattern.len()`. Both arms fire `debug_assert!` on engine-invariant violations. * Rewrite the `TypeReadError::UnsupportedType` variant doc. The old text listed regex/date/timestamp as "reserved for future types not yet evaluatable" -- all three are now implemented, and the variant is actively used for regex pattern compile errors, missing pattern operands, and non-equality operators on pattern-bearing types. Rewrite to describe the current reporting contract. No code changes; this commit only corrects documentation that contradicts the shipped implementation. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Post-merge coverage fix from PR #214 review round 3. All 8 corpus tests in tests/regex_search_corpus_tests.rs previously used programmatic rule construction (`regex_rule`/`search_rule`), so the grammar-to-evaluator pipeline for the new regex/search flag syntax was never tested end-to-end via `parse_text_magic_file`. New tests: * `test_json1_corpus_parser_roundtrip` — parses a plain `0 regex "^\\s*[\\{\\[]" JSON text data` magic rule via `parse_text_magic_file` and evaluates it against `json1.testfile`. Exercises the default-flags, no-count regex path from text through to the evaluator. * `test_regex_flag_parser_roundtrip_case_insensitive` — same pattern with `regex/c` instead of `regex`. Confirms the `/c` flag threads through `parse_type_and_operator` into `RegexFlags::case_insensitive`. * `test_search_parser_roundtrip_with_range` — `search/32 "ABC"` parsed from text and evaluated against `searchbug.testfile`. Exercises the `NonZeroUsize` range path through the grammar layer. * `test_search_parser_rejects_bare_search` — magic-file text `search "ABC"` (no /N) must be a parse error, confirming the `NonZeroUsize` constraint is enforced at parse time, not just at programmatic construction time. These close the "parser round-trip untested for new flag syntax" coverage gap flagged in review. Existing `regex_rule`/`search_rule` programmatic tests are kept for the searchbug hierarchical scan which still depends on features the parser doesn't yet support (`use`/`name`/`offset` directives). Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

…sign Post-merge doc cleanup from PR #214 review round 3. docs/solutions/.../implementing-variable-width-typekind-variant.md: * Rewrite the `## Solution` section so the code blocks describe the final (V5) design instead of the obsolete V0 signatures. The old snippets showed `read_regex` with `start_of_line: bool`, `build_regex` with `^(?:{pattern})` wrapping, and `read_search` with `Option<usize>` — all three were removed in the completion pass but the solution doc was never updated. The `# Prevention` section was already correct; only the code blocks in `## Solution` drifted. * New code blocks describe `build_regex` (unconditional multi-line + no dot-newline), `compute_window` (8192 cap + line-terminator walk), `read_regex`/`read_search` with their current signatures (`RegexFlags`, `Option<NonZeroU32>`, `NonZeroUsize`), and the `/s`-flag anchor-advance in `regex_bytes_consumed`. * Rewrite the `## Testing` section to list the actual test coverage: line-based window tests, 8192 boundary guards, zero-width match distinction, `/s` flag tests, non-equality operator rejection, parser last-wins rejection, corpus integration tests. AGENTS.md: * Development Phases v0.3 entry used to list `/s`, `/l` line-count, `regex/Nl`, search range enforcement, and 8192 default as planned work. All are now implemented. Replace with the actual v0.3 scope: Aho-Corasick multi-pattern optimization, compiled-regex caching, and `!:mime`/`!:ext` directive evaluation. docs/src/evaluator.md: * The Type Reading section described `/l` as "multi-line start-of-line anchoring" — this is the pre-fix semantic, not the current one. Rewrite to list all three regex flags (`/c`, `/s`, `/l`) with correct meanings, note that multi-line mode is unconditional, and describe the 8192-byte hard cap. Also update the Search bullet to mention the mandatory `NonZeroUsize` range. .github/copilot-instructions.md: * Remove "regex type is planned for future releases and is not yet implemented (#39)" note — regex and search are both implemented. * Remove the `BinaryRegex` trait sketch (binary-safe matching is done via `regex::bytes::Regex` directly, not through a wrapper trait). * Update the Currently Implemented list to mention Regex and Search with their flag sets. * Update the v1.0+ Planned Features list to drop regex/search items and add Aho-Corasick + named tests. No code changes; this commit only brings documentation in line with the shipped implementation. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Closes deferred review suggestion #22 from PR #214 round-3. Before: `type_keyword_to_kind` returned a placeholder `TypeKind::Search { range: NonZeroUsize::new(1).unwrap() }` for the `search` keyword and a bare `TypeKind::Regex { .. }` for `regex`, both of which were "safe" only because grammar/mod.rs always overwrites them via direct construction before the value escapes. That's an invariant-by-convention, which is exactly the anti-pattern this PR has been trying to remove via `NonZeroUsize` and `NonZeroU32` in the AST. After: the function returns `Option<TypeKind>` and yields `None` for both `regex` and `search`. Grammar's `_ => type_keyword_to_kind(type_name)` branch now uses `.expect(...)` with an explanatory message that documents the "regex/search handled directly in grammar arms above" invariant — so a future caller who adds a new suffix-required type without wiring the grammar arm first will fail loudly at the `expect`, rather than silently producing a placeholder that gets eaten by downstream code. Tests updated: - `test_type_keyword_to_kind_*` tests wrap each expected `TypeKind` in `Some(...)`. - `test_type_keyword_to_kind_regex_and_search_return_none` is new — it pins both halves of the split invariant (parse_type_keyword recognizes the keyword, type_keyword_to_kind returns None). - `test_pstring_keyword_defaults_to_one_byte_width` uses `.expect()` on the Option return. - `test_roundtrip_all_keywords` is split into two loops: one for convertible keywords (must produce `Some`) and one for regex/search (must produce `None`), so both sides of the contract are asserted explicitly. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Closes deferred review suggestions #19 and #20 from PR #214 round-3. ## #19: RegexCount enum replaces line_based bool + count Option The previous design encoded the three regex scan-window modes across two separate fields: RegexFlags { line_based: bool, .. } TypeKind::Regex { count: Option<NonZeroU32>, .. } This admitted four `(line_based, count)` combinations while libmagic only has three semantic modes, leaving a degenerate `line_based: true, count: None` state that in practice collapsed to the same behavior as `line_based: false, count: None`. It also forced readers to remember that `count` means bytes in one branch and lines in another. Replace both fields with a single `RegexCount` enum: ```rust pub enum RegexCount { Default, // plain `regex` (8192-byte default) Bytes(NonZeroU32), // `regex/N` (byte-bounded) Lines(Option<NonZeroU32>), // `regex/l` or `regex/Nl` } ``` - The three libmagic scan modes map 1:1 to the three variants. - The degenerate `line_based: true, count: None` state is now unrepresentable: line-mode is exclusively the `Lines(_)` variant. - The byte-vs-line distinction is encoded at the type level, so match arms at every call site (`compute_window`, strength calculation, codegen, property tests) read exhaustively rather than relying on a sibling `bool`. - `Lines(None)` preserves the `regex/l` shorthand (no explicit line count), which walks terminators to the end of the 8192-byte capped window. `RegexFlags` is now just `{ case_insensitive, start_offset }`. The `/l` modifier moves from being a flag into being the `Lines` variant of `RegexCount`. Ripple: - `compute_window` (`src/evaluator/types/regex.rs`): rewrite the branch dispatch to match on `RegexCount` instead of `flags.line_based`. - `read_regex` and `regex_bytes_consumed`: take `RegexCount` instead of `(flags, Option<NonZeroU32>)`. - `bytes_consumed_with_pattern`: pass `*count` through via `Copy`. - `strength.rs`: match on `RegexCount` for the "explicit count = more specific = +5 bonus" logic. - `parser/codegen.rs::serialize_type_kind`: emit the three variants explicitly with `NonZeroU32::new(n).expect("nonzero")`. - `parser/grammar/mod.rs`: accumulate `line_based: bool` and `count_value: Option<NonZeroU32>` in the suffix loop, then collapse to the appropriate `RegexCount` variant at the end. - `tests/property_tests.rs::arb_type_kind`: generator now produces all three variants with weighted odds. - All test constructions updated across engine tests, types tests, grammar tests, corpus tests, and the evaluator_tests hierarchical rule fixture. ## #20: REGEX_MAX_BYTES relocated to evaluator::types::regex The constant is runtime evaluation policy, not AST shape — it lives where `compute_window` reads it. Moving it out of `parser::ast` also removes a latent coupling to the build-script compilation unit (`ast.rs` is shared with `build.rs` per GOTCHAS S1.1). The constant is now `pub(crate)` in `src/evaluator/types/regex.rs` with an in-comment explanation of the rationale. The rustdoc on `TypeKind::Regex::count` used to link to `[REGEX_MAX_BYTES]` via intra-doc link; rewritten to cite the "8192-byte cap" inline so the AST module has no dependency on the evaluator constant. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

Closes deferred review suggestion #23 from PR #214 round-3. Before: grammar/mod.rs was 951 lines, past the 800-line hard cap documented in AGENTS.md coding-style.md. The bulk of the growth was the `parse_type_and_operator` function (over 250 lines) which handled pstring, regex, search, and the attached bitwise-AND operator inline. After: grammar/mod.rs is 793 lines. The extracted suffix parsers live in a new `grammar/type_suffix.rs` submodule (232 lines) with four functions: - `parse_pstring_suffix(input)` — the `/[BHhLl][J]?` or bare `/J` pstring length-width parser. Previously inline as a standalone fn; now lives next to its peers. - `parse_regex_suffix(input, suffix_rest)` — the "any order, last count wins but duplicates rejected" regex modifier parser. Returns `(RegexFlags, RegexCount)` with the tri-state collapse inlined. - `parse_search_suffix(input, suffix_rest)` — the mandatory `NonZeroUsize` range parser for `search/N`. - `parse_attached_operator(input)` — the post-type `&<mask>` / `&` / `&&` handler. This is the new extraction; `parse_type_and_operator` previously had ~20 lines of inline match-on-`strip_prefix('&')` logic. `parse_type_and_operator` in grammar/mod.rs is now a thin orchestrator: call `parse_type_keyword`, dispatch to the appropriate suffix parser based on `type_name`, then `parse_attached_operator`, then build the final `TypeKind` in a short match. No behavior change. The old `parse_type_and_operator` had `#[allow(clippy::too_many_lines)]`; that attribute is now removed since the function fits under the default threshold. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

## Summary Follow-up cleanup for [PR #214](#214) (which closed #39). The review-round-3 findings on #214 landed after it merged, so this PR carries the fixes as a separate unit of work. All fixes are internal review findings — no user-visible behavior change beyond the grammar now hard-rejecting duplicate regex counts. ## Changes ### Parser: reject duplicate regex counts (`5ca9a6f`) - `src/parser/grammar/mod.rs`: `regex/1l2l`, `regex/1c2l`, `regex/l1l2` are now hard parse errors. Libmagic accepts them with a `"multiple ranges"` stderr warning; we prefer a hard error so magic-file bugs surface at parse time rather than silently using the last-seen count. - Test: `test_parse_type_and_operator_regex_rejects_duplicate_count`. ### Tests: 8192-byte boundary + bare-CR line terminator (`5ca9a6f`) - Exact-boundary regression tests for the `FILE_REGEX_MAX` cap: pattern ending at byte 8191 (must match), starting at 8192 (must miss), straddling the boundary (must miss), line-based scan respecting the cap. - Bare-CR (classic Mac) line terminator tests: `\r` alone counts as a single terminator, mixed LF/CRLF/CR buffer terminator walking. ### Rustdoc: fix `RegexFlags::count` and `bytes_consumed_with_pattern` drift (`5b9df07`) - `src/parser/ast.rs`: remove the "effective byte cap of `min(n * 80, 8192)`" clause from `TypeKind::Regex::count` rustdoc. No such heuristic exists in `compute_window`. - `src/evaluator/types/mod.rs`: rewrite the `bytes_consumed_with_pattern` docstring section on pattern-bearing types. The old wording claimed Search advances by the window size — that was the pre-fix bug, not current behavior. - `src/evaluator/types/mod.rs`: rewrite the `TypeReadError::UnsupportedType` variant doc. Old text listed regex/date/timestamp as "reserved for future types not yet evaluatable"; all three are now implemented. ### Tests: parser round-trip corpus tests (`648f1a8`) - `tests/regex_search_corpus_tests.rs`: 4 new tests that exercise the grammar → evaluator pipeline for regex/search flag syntax via `parse_text_magic_file`. Previously all corpus tests used programmatic rule construction, so the parser-side of the new flag syntax was never tested end-to-end. - Covers: plain `regex` rule, `regex/c` case-insensitive flag, `search/N` range, and bare `search` parse rejection. ### Docs: rewrite solution doc and clean up stale references (`aeb61d6`) - `docs/solutions/integration-issues/implementing-variable-width-typekind-variant.md`: rewrite the `## Solution` section so the code blocks describe the final (V5) design. Old snippets showed `read_regex` with `start_of_line: bool`, `build_regex` with `^(?:{pattern})` wrapping, and `read_search` with `Option<usize>` — all three were removed in the completion pass but the doc was never updated. - `AGENTS.md`: Development Phases v0.3 entry listed items that are already implemented. Replaced with actual v0.3 scope (Aho-Corasick, regex caching, `!:mime`/`!:ext` directives). - `docs/src/evaluator.md`: Type Reading section described `/l` as "multi-line start-of-line anchoring". Rewritten to list all three regex flags correctly and note the 8192-byte cap. - `.github/copilot-instructions.md`: remove "regex type is planned for future releases" note; remove `BinaryRegex` trait sketch; update Currently Implemented and Planned Features lists. ## Test Plan - [x] `mise exec -- cargo nextest run` — 1218/1218 pass (+11 new regression tests) - [x] `mise exec -- cargo clippy --all-targets -- -D warnings` — clean - [x] `mise exec -- cargo test --doc` — 167/167 doctests pass - [x] `mise exec -- cargo fmt --check` — clean - [x] `mise exec -- just ci-check` — full CI suite green locally - [ ] CodeRabbit review clean - [ ] Manual smoke: verify `regex/1l2l` returns a parse error (hit by `test_parse_type_and_operator_regex_rejects_duplicate_count`) - [ ] Manual smoke: verify a regex pattern past byte 8192 is NOT matched by an un-counted `regex` rule (hit by the three 8192 boundary tests) ## Review Findings Addressed From the round-3 review on PR #214 (after merge): | # | Finding | Status | |---|---|---| | C1 | bare-CR line terminator doc + test | ✅ docstring updated, tests added (`5ca9a6f`) | | C2 | `n * 80` claim in `TypeKind::Regex::count` rustdoc | ✅ removed (`5b9df07`) | | C3 | `bytes_consumed_with_pattern` docstring drift | ✅ rewritten (`5b9df07`) | | C4 | Solution doc `## Solution` section drift | ✅ rewritten (`aeb61d6`) | | C5 | Silent "last count wins" on `regex/1l2l` | ✅ hard parse error (`5ca9a6f`) | | I7 | Missing 8192-byte boundary tests | ✅ 4 new tests (`5ca9a6f`) | | I12 | Parser→evaluator round-trip untested for new flag syntax | ✅ 4 round-trip tests (`648f1a8`) | | I13 | `/l` = "multi-line start-of-line anchoring" in docs/src/evaluator.md | ✅ fixed (`aeb61d6`) | | I14 | AGENTS.md v0.3 roadmap stale | ✅ fixed (`aeb61d6`) | | I15 | `TypeReadError::UnsupportedType` variant doc | ✅ rewritten (`5b9df07`) | | I16 | `.github/copilot-instructions.md` stale regex reference | ✅ fixed (`aeb61d6`) | Deferred (design-level, warrant their own discussion): - Suggestion #19: `RegexCount { Default, Bytes, Lines }` enum refactor - Suggestion #20: move `REGEX_MAX_BYTES` from `ast.rs` to `evaluator::types::regex` - Suggestion #21: rename `line_based` → `line_counted` - Suggestion #22: `type_keyword_to_kind` signature change - Suggestion #23: split `src/parser/grammar/mod.rs` (now 914 lines) Refs #39 --------- Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> Co-authored-by: dosubot[bot] <131922026+dosubot[bot]@users.noreply.github.com>

## 🤖 New release * `libmagic-rs`: 0.5.0 -> 0.6.0 (⚠ API breaking changes) ### ⚠ `libmagic-rs` breaking changes ```text --- failure constructible_struct_adds_field: externally-constructible struct adds field --- Description: A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field. ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/constructible_struct_adds_field.ron Failed in: field MagicRule.value_transform in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:1189 field MagicRule.value_transform in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:1189 field MagicRule.value_transform in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:1189 --- failure copy_impl_added: type now implements Copy --- Description: A public type now implements Copy, causing non-move closures to capture it by reference instead of moving it. ref: rust-lang/rust#100905 impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/copy_impl_added.ron Failed in: libmagic_rs::mime::MimeMapper in /tmp/.tmpwFvgw1/libmagic-rs/src/mime.rs:98 --- failure enum_marked_non_exhaustive: enum marked #[non_exhaustive] --- Description: A public enum has been marked #[non_exhaustive]. Pattern-matching on it outside of its crate must now include a wildcard pattern like `_`, or it will fail to compile. ref: https://doc.rust-lang.org/cargo/reference/semver.html#attr-adding-non-exhaustive impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/enum_marked_non_exhaustive.ron Failed in: enum OffsetSpec in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:198 enum OffsetSpec in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:198 enum OffsetSpec in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:198 enum LibmagicError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:15 enum LibmagicError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:15 enum IoError in /tmp/.tmpwFvgw1/libmagic-rs/src/io/mod.rs:26 enum Operator in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:838 enum Operator in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:838 enum Operator in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:838 enum TypeReadError in /tmp/.tmpwFvgw1/libmagic-rs/src/evaluator/types/mod.rs:56 enum ParseError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:74 enum ParseError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:74 enum Value in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:965 enum Value in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:965 enum Value in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:965 enum TypeKind in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:398 enum TypeKind in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:398 enum TypeKind in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:398 enum EvaluationError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:148 enum EvaluationError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:148 --- failure enum_struct_variant_field_added: pub enum struct variant field added --- Description: An enum's exhaustive struct variant has a new field, which has to be included when constructing or matching on this variant. ref: https://doc.rust-lang.org/reference/attributes/type_system.html#the-non_exhaustive-attribute impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/enum_struct_variant_field_added.ron Failed in: field base_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:251 field adjustment_op of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:266 field result_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:272 field base_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:251 field adjustment_op of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:266 field result_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:272 field base_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:251 field adjustment_op of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:266 field result_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:272 --- failure function_missing: pub fn removed or renamed --- Description: A publicly-visible function cannot be imported by its prior path. A `pub use` may have been removed, or the function itself may have been renamed or removed entirely. ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/function_missing.ron Failed in: function libmagic_rs::parser::grammar::is_empty_line, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:1025 function libmagic_rs::parser::grammar::parse_strength_directive, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:846 function libmagic_rs::parser::grammar::parse_type_and_operator, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:683 function libmagic_rs::parser::grammar::parse_offset, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:179 function libmagic_rs::parser::parse_offset, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:179 function libmagic_rs::parser::grammar::parse_comment, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:1004 function libmagic_rs::parser::grammar::parse_message, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:810 function libmagic_rs::parser::grammar::parse_value, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:633 function libmagic_rs::parser::grammar::parse_number, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:133 function libmagic_rs::parser::parse_number, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:133 function libmagic_rs::parser::grammar::has_continuation, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:1060 function libmagic_rs::parser::grammar::parse_magic_rule, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:946 function libmagic_rs::parser::grammar::parse_rule_offset, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:779 function libmagic_rs::parser::grammar::is_comment_line, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:1042 function libmagic_rs::parser::grammar::is_strength_directive, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:902 function libmagic_rs::parser::grammar::parse_type, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:749 function libmagic_rs::parser::grammar::parse_operator, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:227 --- failure function_parameter_count_changed: pub fn parameter count changed --- Description: A publicly-visible function now takes a different number of parameters. ref: https://doc.rust-lang.org/cargo/reference/semver.html#fn-change-arity impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/function_parameter_count_changed.ron Failed in: libmagic_rs::evaluator::evaluate_single_rule now takes 3 parameters instead of 2, in /tmp/.tmpwFvgw1/libmagic-rs/src/evaluator/engine/mod.rs:196 --- failure inherent_method_missing: pub method removed or renamed --- Description: A publicly-visible method or associated fn is no longer available under its prior name. It may have been renamed or removed entirely. ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/inherent_method_missing.ron Failed in: FileBuffer::create_symlink, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/io/mod.rs:326 EvaluationContext::increment_recursion_depth, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/evaluator/mod.rs:114 EvaluationContext::decrement_recursion_depth, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/evaluator/mod.rs:130 EvaluationContext::increment_recursion_depth, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/evaluator/mod.rs:114 EvaluationContext::decrement_recursion_depth, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/evaluator/mod.rs:130 --- failure module_missing: pub module removed or renamed --- Description: A publicly-visible module cannot be imported by its prior path. A `pub use` may have been removed, or the module may have been renamed, removed, or made non-public. ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/module_missing.ron Failed in: mod libmagic_rs::parser::grammar, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:4 --- failure struct_marked_non_exhaustive: struct marked #[non_exhaustive] --- Description: A public struct has been marked #[non_exhaustive], which will prevent it from being constructed using a struct literal outside of its crate. It previously had no private fields, so a struct literal could be used to construct it outside its crate. ref: https://doc.rust-lang.org/cargo/reference/semver.html#attr-adding-non-exhaustive impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/struct_marked_non_exhaustive.ron Failed in: struct EvaluationConfig in /tmp/.tmpwFvgw1/libmagic-rs/src/config.rs:42 ``` <details><summary>Changelog</summary> <blockquote> ## [0.6.0] - 2026-04-25 ### Features - **parser**: Add Date and QDate types with serialization support ([#165](#165)) - **parser**: Implement pstring (Pascal string) type ([#170](#170)) - **parser**: Implement pstring multi-byte length prefix variants (/B, /H, /h, /L, /l, /J) ([#183](#183)) - **evaluator**: Add debug-level tracing for skipped rules ([#184](#184)) - **evaluator**: Implement indirect offset resolution ([#37](#37)) ([#199](#199)) - **evaluator**: Implement relative offset resolution ([#38](#38)) ([#211](#211)) - **deps**: Add new skills to actionbook/rust-skills and trailofbits/skills - **evaluator**: Regex and search types (closes #39) ([#214](#214)) - Implement libmagic meta-type directives and format substitution ([#42](#42)) ([#230](#230)) ### Bug Fixes - **regex**: PR #214 follow-up review findings ([#215](#215)) - Load and correctly evaluate /usr/share/file/magic/filesystems and adjacent magic files ([#233](#233)) ### Documentation - **gotchas**: Clarify requirements for adding TypeKind variants ### Miscellaneous Tasks - Rename .coderabbitai.yaml to .coderabbit.yaml - **Mergify**: Configuration update ([#173](#173)) - Update .gitignore to exclude local AI assistant files - **mergify**: Upgrade configuration to current format ([#205](#205)) - Resolve all pending TODO items ([#212](#212)) - **mergify**: Upgrade configuration to current format ([#231](#231))  ### Security - **io**: Close TOCTOU race in `FileBuffer::new` metadata validation (CWE-367). `validate_file_metadata` now uses `File::metadata()` on the open descriptor instead of re-canonicalizing the path, so an attacker cannot swap the path between `open_file` and validation. Error paths now report the caller-supplied path rather than the canonicalized variant. - **cli**: Remove relative-path fallbacks from `default_magic_file_path` (CWE-426). `./missing.magic`, `./third_party/magic.mgc`, and the `CI`/`GITHUB_ACTIONS` env-var branch no longer resolve against the process cwd. CI pipelines must pass `--magic <path>` explicitly. - **evaluator**: `build_regex` now bounds `size_limit` and `dfa_size_limit` to 1 MiB (`REGEX_COMPILE_SIZE_LIMIT`) to reject compile-time DoS patterns (CWE-1333) from adversarial magic files. ### Features - **parser**: Implement meta-type directives: `name`/`use` subroutines, `default`/`clear` per-level fallback, and `indirect` re-evaluation. `parse_text_magic_file` now returns `ParsedMagic { rules, name_table }` (breaking change from `Vec<MagicRule>`). Named subroutines are hoisted into `NameTable` at load time and dispatched via `RuleEnvironment` in the evaluator. Recursion is bounded by `EvaluationConfig::max_recursion_depth`. Resolves [#42](#42). - **evaluator**: Thread-local regex compile cache eliminates the double-compile paid by every successful regex match. `regex_bytes_consumed` now reuses the compiled `Regex` from `read_regex` instead of recompiling the pattern to derive the anchor advance. The cache is reset at the start of every `evaluate_rules_with_config` call, bounding memory to one evaluation. - **config**: `EvaluationConfig` is now `#[non_exhaustive]`; new builder-style setters (`with_max_recursion_depth`, `with_max_string_length`, `with_stop_at_first_match`, `with_mime_types`, `with_timeout_ms`) let external crates construct configurations without struct literals. - **parser**: `MagicRule::new()` smart constructor with `::with_children()`, `::with_strength_modifier()`, `::with_level()` builder methods and a `::validate()` method enforcing structural invariants (non-empty message, `level <= MAX_LEVEL`, children nested strictly deeper than parent). New `MagicRuleValidationError` error type. - **parser**: `RegexFlags::with_case_insensitive()` and `::with_start_offset()` builder methods. ### Refactor - **engine**: Extract `evaluate_pattern_rule()` and `evaluate_value_rule()` helpers from `evaluate_single_rule_with_anchor`'s 90-line body. Dispatch is now a two-arm type-category split; each helper has focused rustdoc on semantics and invariants. - **types**: Replace the `_ =>` catch-all in `bytes_consumed_with_pattern` with an explicit listing of the fixed-width `TypeKind` variants. Adding a new variable-width variant without updating this match is now a compile error instead of a silent relative-offset anchor corruption in release builds. - **parser**: Split the 185-line `type_keyword_to_kind` match into per-family helpers (`byte_family`, `short_family`, `long_family`, `quad_family`, `float_family`, `double_family`, `date_family`, `qdate_family`, `string_family`). Drops the `#[allow(clippy::too_many_lines)]` attribute. - **main**: `main()` returns `std::process::ExitCode` instead of calling `process::exit`, so destructors run on the happy path. Ctrl-C `AtomicBool` flag uses `Ordering::Relaxed` instead of `SeqCst`. - **grammar**: `parse_strength_directive` uses nom 8's `preceded` + `Parser::map` instead of the legacy `map(pair(char(...), parse_number), |(_, n)| ...)` pattern. - **output**: Add `#[serde(skip_serializing_if = "Option::is_none", default)]` to public `Option<T>` fields so JSON output no longer emits `"field": null` for unset optional values. ### Documentation - **lib**: Add `# Security` sections to `MagicDatabase::with_builtin_rules`, `::with_builtin_rules_and_config`, `::load_from_file`, and `::load_from_file_with_config` warning about the unbounded default timeout and recommending `EvaluationConfig::performance()` for untrusted input. - **lib**: Document `MagicDatabase: Send + Sync` for parallel scanning. - **README**: Update `TypeKind` enum example to match the current AST, add `regex` and `search/N` to the supported types table, add pre-1.0 API stability warning, correct the roadmap to mark v0.2-v0.4 as shipped. - **AGENTS.md**: Relabel "Currently Implemented (v0.1.0)" and "Current Limitations (v0.1.0)" to v0.5.0 and rewrite the Development Phases section to reflect actual shipped scope. ### Testing - Security regression tests for S-H1 (planted-magic-file in cwd), S-H2 (TOCTOU path-swap contract), S-M2 (pathological regex bounded runtime), S-L2 (codegen message escape round-trip), and GOTCHAS S13.1 (`EvaluationConfig::default()` unbounded timeout invariant). - Backspace message concatenation regression tests for first-match, consecutive, and empty-rest edge cases. - `MagicRule::validate()` tests covering empty message, child level invariant, and max-depth rejection. - `RegexCache` population/clear/reuse tests. ### Breaking Changes - **parser**: `parse_text_magic_file` return type changed from `Result<Vec<MagicRule>, ParseError>` to `Result<ParsedMagic, ParseError>`. Callers must destructure `ParsedMagic { rules, name_table }`. Low-level callers that only need the rule list can use `parsed.rules`. `load_magic_file` and `load_magic_directory` return the same new type. </blockquote> </details> --- This PR was generated with [release-plz](https://github.com/release-plz/release-plz/). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

unclesp1d3r added 7 commits April 10, 2026 22:49

docs(project): update language server list and configuration options

ae71cac

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

unclesp1d3r linked an issue Apr 11, 2026 that may be closed by this pull request

Parser: implement regex and search types #39

Closed

7 tasks

Copilot AI review requested due to automatic review settings April 11, 2026 05:54

dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Apr 11, 2026

Copilot started reviewing on behalf of unclesp1d3r April 11, 2026 05:55 View session

dosubot Bot added evaluator Rule evaluation engine and logic parser Magic file parsing components and grammar testing Test infrastructure and coverage labels Apr 11, 2026

Copilot AI reviewed Apr 11, 2026

View reviewed changes

docs: Dosu updates for PR #214

a13dd13

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

coderabbitai Bot removed parser Magic file parsing components and grammar evaluator Rule evaluation engine and logic testing Test infrastructure and coverage labels Apr 11, 2026

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

unclesp1d3r merged commit c05dfdc into main Apr 11, 2026
30 checks passed

unclesp1d3r deleted the 39-parser-implement-regex-and-search-types branch April 11, 2026 06:30

unclesp1d3r restored the 39-parser-implement-regex-and-search-types branch April 11, 2026 06:31

unclesp1d3r deleted the 39-parser-implement-regex-and-search-types branch April 11, 2026 06:31

unclesp1d3r mentioned this pull request Apr 11, 2026

fix(regex): PR #214 follow-up review findings #215

Merged

8 tasks

This was referenced Apr 25, 2026

Implement string-type flag semantics (/c /C /w /W /B /b /t /T) #234

Open

Implement search-type flag semantics (/s start-anchor, /b binary, etc.) #235

Open

github-actions Bot mentioned this pull request Apr 25, 2026

chore: release v0.6.0 #240

Merged

	- Regex: Binary-safe regex matching via `regex::bytes::Regex`; the `/c` flag enables case-insensitive matching and `/l` enables multi-line start-of-line anchoring
	- Regex: Binary-safe regex matching via `regex::bytes::Regex`; the `/c` flag enables case-insensitive matching, `^`/`$` line-boundary behavior is always enabled, and `/l` switches scan-window bounding to line-count semantics rather than byte-count semantics

	//! search window is `buffer[offset..]` capped by the optional `range`.
	//! search window is `buffer[offset..]` capped by the mandatory `range`
	//! (or the remaining buffer length, if smaller).

Uh oh!

Conversation

unclesp1d3r commented Apr 11, 2026

Summary

Issue #39 Acceptance Criteria

Test Plan

Notes for Reviewers

Uh oh!

mergify Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

🟢 Full CI must pass

🟢 Do not merge outdated PRs

Uh oh!

coderabbitai Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

dosubot Bot commented Apr 11, 2026

ARCHITECTURE /libmagic-rs/blob/main/docs/ARCHITECTURE.md — ⏳ Awaiting Merge

ast-structures /libmagic-rs/blob/main/docs/src/ast-structures.md — ⏳ Awaiting Merge

compatibility /libmagic-rs/blob/main/docs/src/compatibility.md — ⏳ Awaiting Merge

parser /libmagic-rs/blob/main/docs/src/parser.md — ⏳ Awaiting Merge

Uh oh!

codecov Bot commented Apr 11, 2026

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

mergify Bot commented Apr 11, 2026 •

edited

Loading

coderabbitai Bot commented Apr 11, 2026 •

edited

Loading

ARCHITECTURE `/libmagic-rs/blob/main/docs/ARCHITECTURE.md` — ⏳ Awaiting Merge

ast-structures `/libmagic-rs/blob/main/docs/src/ast-structures.md` — ⏳ Awaiting Merge

compatibility `/libmagic-rs/blob/main/docs/src/compatibility.md` — ⏳ Awaiting Merge

parser `/libmagic-rs/blob/main/docs/src/parser.md` — ⏳ Awaiting Merge