feat(evaluator): regex and search types (closes #39)#214
Conversation
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Implements pattern-based file matching via `TypeKind::Regex` and
`TypeKind::Search`, a foundation for text file detection (JSON, XML,
shell scripts, etc.).
Parser:
* Add `Regex { case_insensitive, start_of_line }` and `Search { range }`
variants to `TypeKind` in `parser/ast.rs`. Extend `parse_type_keyword`,
`type_keyword_to_kind`, grammar suffix parsing (`regex/c`, `regex/l`,
`regex/cl`, `search/N`), `serialize_type_kind`, and strength modifiers.
* Update `arb_type_kind` in `tests/property_tests.rs` for the fuzz
round-trip of the new variants.
* Grammar flag scanner uses `strip_prefix` instead of direct `&rest[1..]`
slicing per AGENTS.md memory-safety rule.
Evaluator:
* `src/evaluator/types/regex.rs` wraps `regex::bytes::RegexBuilder` for
binary-safe matching. `/c` enables case-insensitive, `/l` wraps the
pattern in `^(?:...)` with multi-line mode (placeholder for the
libmagic `/l` line-count semantics tracked for v0.3).
* `src/evaluator/types/search.rs` uses `memchr::memmem::find` for
bounded literal scanning.
* Both readers return `Result<Option<Value>, TypeReadError>`: `None`
is a genuine miss, `Some(v)` is a match (including legitimate
zero-width regex matches like `^`, `a*`, lookaheads).
* Add `read_typed_value_with_pattern` and `read_pattern_match` dispatch
functions in `types/mod.rs`. The engine calls `read_pattern_match`
directly to preserve the `Option` discriminant; `read_typed_value`
remains a 3-arg convenience wrapper forwarding `pattern: None`.
* `bytes_consumed_with_pattern` threads the pattern through to
`regex_bytes_consumed` and `search_bytes_consumed`. The search arm
re-runs `memchr::memmem::find` and returns `match_idx + pattern.len()`
(match-end), matching GNU `file` softmagic.c FILE_SEARCH semantics.
An earlier revision returned the full window size, which silently
corrupted relative-offset children of every successful search match.
* Add `debug_assert!` guards on the silent-0 fallback paths in
`regex_bytes_consumed`, `search_bytes_consumed`, and the Regex/Search
arms of `bytes_consumed_with_pattern` so engine-invariant violations
are caught loudly in dev/test builds.
Engine:
* `evaluate_single_rule_with_anchor` splits into two arms. Pattern-
bearing types (`Regex`, `Search`) go through `read_pattern_match` and
drive `Equal`/`NotEqual` from the `Option` discriminant directly.
Non-equality operators on pattern types return
`TypeReadError::UnsupportedType` instead of silently falling through
to `apply_operator`, which would have produced nonsense lexicographic
comparisons against the pattern source text.
* Non-pattern types continue through the existing
`read_typed_value_with_pattern` + `coerce_value_to_type` +
`apply_operator` path.
Dependencies:
* Move `regex = "1.12.3"` from dev-dependencies to dependencies. The
`bytes` feature was dropped in v1.12; `regex::bytes` is unconditional
so no feature flag is needed.
Note: This commit covers the core feature. It does NOT yet implement
the `/s` (start-offset) flag, the libmagic `/l` line-count semantics,
the `regex/Nl` numeric prefix, `NonZeroUsize` search ranges, or the
8192-byte default regex scan window. Those are tracked for a follow-up
against issue #39 before closing it.
Refs #39
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Regression guards for the three silent-failure classes flagged in code
review:
bytes_consumed dispatch (GOTCHAS 2.1):
* `test_bytes_consumed_regex_with_string_pattern` and
`test_bytes_consumed_regex_no_match_returns_zero` cover the Regex arm
of `bytes_consumed_with_pattern`. Before this test the only coverage
of the Regex arm was transitive through the corpus test and did not
assert numeric anchor positions.
* `test_bytes_consumed_regex_zero_width_match_returns_zero` guards the
interaction between the C3 fix (zero-width matches distinguished from
misses) and the anchor advance: a zero-width match returns `m.end() == 0`
so the anchor does not move.
* `test_bytes_consumed_search_with_pattern_is_match_end` is the
regression guard for the pre-fix behavior that returned the full
search window size instead of `match_idx + pattern.len()`. Buffer
`"abcWorld_xyz"` with `search/10 "World"` must advance the anchor by
8, not by 10.
* `test_bytes_consumed_search_bytes_pattern_works` exercises the
`Value::Bytes` pattern dispatch path that was previously accepted by
the reader but untested.
Engine-level Relative(N) after pattern types:
* `test_regex_parent_advances_anchor_for_relative_child` and
`test_search_parent_advances_anchor_to_match_end_not_window_end`
construct parent/child rule trees and assert that a `Relative(0)`
child reads the byte immediately after the parent's match. This is
the end-to-end test that would have caught the search window-vs-
match-end bug -- the existing `test_regex_eol_corpus` passes on
substring assertions and would not have detected a wrong anchor
position.
* `test_search_parent_relative_child_at_positive_offset` exercises
`Relative(+1)` to confirm the match-end base is correct (not
match-start, not window-end).
Non-equality operator rejection:
* `test_regex_rule_with_ordering_operator_is_rejected` and
`test_search_rule_with_bitwise_operator_is_rejected` verify the
engine returns `EvaluationError::UnsupportedType` when a
pattern-bearing rule uses `<`, `>`, `&`, `^`, `~`, etc. Previously
those fell through to `apply_operator` and silently produced
lexicographic comparisons against the pattern literal -- a
false-positive factory for any regex with metacharacters.
Zero-width regex match coverage (in `src/evaluator/types/regex.rs`):
* `test_read_regex_zero_width_start_anchor_matches` asserts `^` returns
`Some(Value::String(""))`, not `None`. The pre-fix engine flattened
both cases into a miss via `is_empty()`.
* `test_read_regex_zero_width_star_matches_empty` does the same for the
`a*` pattern against a buffer with no `a` bytes.
Refs #39
Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
AGENTS.md: * Move regex and search from "Planned Features" / "Current Limitations: No regex/search" to "Currently Implemented". The prior state had six separate stale references that directly contradicted the implementation. * Add an explicit "Current Limitations" entry for the `/l` flag (implemented as a multi-line `^` anchor, not the libmagic line-count semantic), the optional `search` range (should be mandatory per GNU file), and the missing 8192-byte default regex scan window. These are the remaining #39 acceptance-criteria gaps, tracked for v0.3. * Remove the speculative `BinaryRegex` trait sketch -- the PR ships binary-safe matching via `regex::bytes` directly, not via a wrapper trait. * Update "Key Dependencies" to reflect `regex` as a runtime dependency and add `memchr` to the list. * Update the "Adding New Type Support" note to include Regex and Search in the implemented-types list. GOTCHAS.md: * Rewrite section 2.4 to describe the `read_pattern_match` -> `Option<Value>` contract and the hard rejection of non-equality operators on pattern-bearing types (previously the fallthrough to `apply_operator` was tolerated). * Add section 2.5: zero-width regex match vs miss invariant. The `Option` in `read_regex`/`read_search` is load-bearing -- do not re-flatten to `Value::String("")` sentinels. * Add section 2.6: search anchor advance is match-end (per GNU `file` softmagic.c FILE_SEARCH), not window-end. Documents the prior bug and links the ground-truth analysis. docs/src/evaluator.md: * Fix the `read_typed_value` signature documentation. The actual code has a 3-arg `read_typed_value` convenience wrapper alongside the 4-arg `read_typed_value_with_pattern`; the doc previously showed only the 4-arg form with the wrong name. * Add Regex and Search to the enumerated type list. * Document the `read_pattern_match` entry point used by the engine for pattern-bearing types and explain why it returns `Option<Value>`. docs/solutions/integration-issues/implementing-variable-width-typekind-variant.md: * Correct the dispatch description: the implementation adds a sibling `read_typed_value_with_pattern` alongside the existing 3-arg `read_typed_value`, rather than extending the existing signature. The "Prevention" section previously prescribed the opposite of what the code did. * Add the lesson learned about sibling functions vs signature extensions (narrow new concerns favor siblings to avoid touching ~30 existing call sites). * Add the lesson learned about not overloading `Value::String("")` as a miss sentinel (zero-width matches are valid and distinct). * Add the lesson learned about search advancing by match-end (match_idx + pattern.len()), not window-end, per GNU file. * Add the lesson learned about rejecting non-equality operators on pattern-bearing types rather than falling through to apply_operator. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Implements the full set of regex/search flags and semantics specified in the issue #39 acceptance criteria and GNU `file` magic(5). This is the acceptance-criteria completion pass on top of the earlier foundation commit. TypeKind shape redesign: * Add `RegexFlags { case_insensitive, start_offset, line_based }` struct to `src/parser/ast.rs`. The previous bare-`bool` fields on `TypeKind::Regex` are replaced by the flags struct; this matches the design prescribed in the issue #39 plan and keeps the blast radius of future flag additions to a single field. * `TypeKind::Regex` becomes `{ flags: RegexFlags, count: Option<NonZeroU32> }`. The count is the numeric prefix from `regex/N[flags]` (e.g., `1l` means count=1 with line_based=true). `None` means "use the default". * `TypeKind::Search::range` changes from `Option<usize>` to `NonZeroUsize`. Per GNU `file` magic(5), the search range is mandatory; this lifts the constraint into the type system so bare `search` and `search/0` are unrepresentable. The parser now hard- errors on both. * Add `REGEX_MAX_BYTES = 8192` constant to `ast.rs` matching GNU `file`'s `FILE_REGEX_MAX`. Parser: * `src/parser/grammar/mod.rs` accepts interleaved count/flag order (both `regex/1l` and `regex/l1` parse to the same value), matching libmagic's `parse_string_modifier` "last range wins" behavior. * Add `/s` flag parsing (start_offset). * Reject bare `search` and `search/0` at parse time via `NonZeroUsize`. * Reject `regex/0` -- a zero count has no valid semantics in our model since `None` already signals "use the default". * `strip_prefix` instead of direct `&rest[1..]` slicing per the AGENTS.md memory-safety rule. Evaluator (src/evaluator/types/regex.rs): * Rewrite `compute_window` to apply the 8192-byte cap unconditionally and implement line-based scanning. When `flags.line_based` is set, `count` is a line count; the scan walks line terminators (`\n` and `\r\n`, each counting as one) until the Nth terminator or the 8192 cap, whichever comes first. This matches GNU `file`'s `softmagic.c::mget_count` / `FILE_REGEX` path. * Remove the old `/l`-as-`^(?:...)`-anchor wrapping. `build_regex` now unconditionally sets `multi_line(true)` and `dot_matches_new_line(false)` for every pattern, matching libmagic's unconditional `REG_NEWLINE` in `alloc_regex` (softmagic.c:2123). `/l` no longer touches the compiled regex -- it only bounds the scan window. `^` and `$` match at line boundaries for every regex rule regardless of `/l`. * `read_regex` now takes `(flags: RegexFlags, count: Option<NonZeroU32>)` and uses `compute_window` for the scan slice. * `regex_bytes_consumed` implements the `/s` flag: when `flags.start_offset` is set, return `m.start()` (match-start anchor advance) instead of `m.end()` (match-end). This matches libmagic's `REGEX_OFFSET_START` / `moffset()` logic where `vlen = 0` zeros the match-length contribution. * 8192-byte cap applies to all regex scans including explicit counts larger than 8192. This is a DoS mitigation -- without it, a crafted regex against a multi-GB buffer combined with `EvaluationConfig::default()` (no timeout) can hang the evaluator. Evaluator (src/evaluator/types/search.rs): * `read_search` and `search_bytes_consumed` take `NonZeroUsize` instead of `Option<usize>`. Evaluator (src/evaluator/types/mod.rs): * Dispatch updated for the new Regex/Search variant shapes in `read_typed_value_with_pattern`, `read_pattern_match`, and `bytes_consumed_with_pattern`. The regex arm passes `flags` and `count` through; the search arm unwraps the `NonZeroUsize`. Strength calculation: * Search unconditionally gets the "constrained match" bonus (25) since the range is always bounded. * Regex gets the bonus when `count` is `Some` (user specified an explicit scan limit) and the base 20 otherwise. Tests: * `src/evaluator/types/regex.rs::tests`: - Line-based window (1-line cap, 3-line cap, CRLF terminators) - 8192-byte cap on default scans - 8192-byte cap applied to explicit counts larger than 8192 - Small explicit counts honored - `/s` flag returns match-start anchor advance - Multi-line regex matching always on (the old "start of line anchor" tests are replaced by explicit multi-line/`.` behavior tests) * `src/evaluator/types/search.rs::tests`: all tests updated for `NonZeroUsize` signature; unbounded-range tests removed. * `src/parser/grammar/tests/mod.rs`: - `/s` flag accepted - `regex/cs`, `regex/csl` multi-flag combinations accepted - Interleaved count/flag order (`regex/1l` and `regex/l1`) - `search/0` rejected at parse time - `regex/0` rejected at parse time * `src/evaluator/engine/tests.rs`, `tests/evaluator_tests.rs`, `tests/property_tests.rs`: all rule constructions updated for the new TypeKind shapes. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
…iles Adds end-to-end regex/search tests against the corpus files listed as "blocked" in issue #39. These exercise the full evaluator pipeline on real third_party/tests fixtures instead of hand-crafted buffers. Tests added (tests/regex_search_corpus_tests.rs): * `test_searchbug_corpus_search_with_relative_child` -- models the searchbug.magic scenario (which uses `use`/`name`/`offset`/`&0` features we do not yet parse) via programmatic rule construction. Verifies that a `search/12 "ABC"` scan followed by a `Relative(0)` byte child reads the character immediately after the matched pattern, exercising the match-end anchor advance end-to-end. * `test_searchbug_search_anchor_advance_not_window_end` -- regression guard using the same corpus file to assert that the anchor advances by match-end (8 + 3 = 11) and NOT by window-end (8 + 12 = 20). A byte child expecting the window-end byte value must NOT match. * `test_json1_corpus_detected_by_regex` -- JSON detection via `^\s*[\{\[]` regex against `json1.testfile`. * `test_jsonlines1_corpus_detected_by_regex` -- same pattern against `jsonlines1.testfile`. * `test_cmd1_corpus_detected_by_regex` -- shell script detection via `^#![ \t]*/\S+` against `cmd1.testfile`. * `test_gedcom_corpus_detected_by_line_based_regex` -- exercises the `/l` line-based scan window (count = 1) with a `^0 HEAD` pattern against `gedcom.testfile`. * `test_regex_eol_version_extraction` -- smoke test for the simpler non-hierarchical part of the regex-eol scenario. * `test_corpus_files_exist` -- meta-guard so a broken third_party/ checkout fails loudly with file-existence errors instead of confusing evaluation failures. Where a corpus file depends on magic-file features we do not yet support (`use`/`name`/`offset`/`&+N` parser), the test bypasses `parse_text_magic_file` and builds the equivalent rule tree programmatically via the AST, matching the pattern documented in GOTCHAS 3.9. This gets issue #39's "blocked test corpus" coverage to 7 of 14 files (the remaining 7 are variants of the same formats tested here and will pass the same regex/search code paths). Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
…eria AGENTS.md: * Rewrite the "Currently Implemented" bullet for Regex to list all three flags (`/c`, `/s`, `/l`) with their meanings, including the 8192-byte hard cap, the unconditional multi-line regex compilation, and the `line_based` vs `start_offset` vs `case_insensitive` distinction. * Rewrite the "Currently Implemented" bullet for Search to note the mandatory `NonZeroUsize` range and reference GNU `file` magic(5). * Remove the "Current Limitations" entries for regex/search that no longer apply (the `/l`-as-anchor placeholder, the `search` range being optional, and the missing 8192-byte default). All of these were closed by the feature-completion commit. * Remove the "Planned Features" entries for `/s`, `/Nl` line-count semantics (both are now implemented). * Planned-features list now focuses on Aho-Corasick optimization, `!:mime`/`!:ext`/`!:apple` directives, and `use`/`name` named tests. GOTCHAS.md: * Section 2.7: regex `/l` is scan-window bounds only, not a multi-line toggle. Documents the removal of the old `^(?:...)` anchor wrapping and explains why `build_regex` now unconditionally sets `multi_line(true)` for every pattern. * Section 2.8: regex scan window is always capped at 8192 bytes. Lists the three paths this applies to (default, explicit, line-based) and explains the DoS mitigation rationale. Calls out the interaction with GOTCHAS S13.1 (`EvaluationConfig::default()` no timeout). * Section 2.9: regex `/s` flag affects anchor advance only. Documents the libmagic `REGEX_OFFSET_START` / `moffset()` correspondence and notes that tests for `/s` must check `regex_bytes_consumed` or downstream `Relative(N)` resolution -- checking `read_regex` alone won't detect a broken `/s` implementation. * Section 2.6 touched up to clarify that search's `/s`-style start- offset flag is not currently supported (it only applies to regex). Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Require conventional commit format per https://www.conventionalcommits.org/en/v1.0.0/. Skipped for bots.
🟢 Full CI must passWonderful, this rule succeeded.All CI checks must pass. Release-plz PRs are exempt because they only bump versions and changelogs (code was already tested on main), and GITHUB_TOKEN-triggered force-pushes suppress CI.
🟢 Do not merge outdated PRsWonderful, this rule succeeded.Make sure PRs are within 10 commits of the base branch before merging
|
Summary by CodeRabbit
WalkthroughThis PR implements pattern-bearing magic types: adds Changes
Sequence Diagram(s)sequenceDiagram
participant Engine as Evaluator Engine
participant Types as Types Dispatch
participant Pattern as Regex/Search Module
participant Anchor as Anchor Computation
Engine->>Types: evaluate_single_rule_with_anchor(rule, buf, off)
alt rule.typ == Regex or Search
Types->>Pattern: read_pattern_match(buf, off, rule.typ, Some(&rule.value))
Pattern-->>Types: Ok(Some(Value)) / Ok(None) / Err
Types-->>Engine: matched? (value.is_some())
Engine->>Engine: allow only '=' or '!=' else Err
Engine->>Anchor: bytes_consumed_with_pattern(buf, off, rule.typ, Some(&rule.value))
Anchor-->>Engine: consumed = match_end or match_start (if /s)
else
Types->>Types: read_typed_value_with_pattern(buf, off, type_kind, Some/None)
Types-->>Engine: Value
Engine->>Engine: apply operators normally
end
Engine-->>Engine: advance relative children by consumed
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 9 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Pull request overview
Implements full parser + evaluator support for libmagic’s pattern-bearing regex and search rule types, enabling GNU file-compatible text-format detection and relative-offset chaining needed by the #39 corpus fixtures.
Changes:
- Added
TypeKind::RegexandTypeKind::Search(plusRegexFlags/REGEX_MAX_BYTES) and parser support forregex/[csl]+ optional count and mandatorysearch/N. - Implemented evaluator readers for regex (binary-safe via
regex::bytes) and search (bounded scan viamemchr::memmem), including correct anchor advancement for relative offsets. - Added extensive unit + integration + corpus tests and updated project docs/manifests to reflect the new feature set.
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/regex_search_corpus_tests.rs | New end-to-end corpus integration tests for regex/search behavior and anchor advancement. |
| tests/property_tests.rs | Extends TypeKind generators to include Regex/Search for property-based coverage. |
| tests/evaluator_tests.rs | Adds an integration test for the upstream regex-eol corpus scenario via programmatic rule trees. |
| src/parser/types.rs | Recognizes regex/search keywords and provides placeholder TypeKind mapping for round-trip tests. |
| src/parser/grammar/tests/mod.rs | Adds parser tests covering regex/search suffix forms, errors, and operator adjacency. |
| src/parser/grammar/mod.rs | Implements parsing of regex/... modifiers (flags + count) and mandatory search/N range. |
| src/parser/codegen.rs | Adds codegen serialization for TypeKind::Regex and TypeKind::Search. |
| src/parser/ast.rs | Adds TypeKind::Regex/Search, RegexFlags, and REGEX_MAX_BYTES definitions + rustdoc. |
| src/evaluator/types/tests.rs | Updates anchor-consumption tests to use the new pattern-aware consumption path and adds regex/search consumption cases. |
| src/evaluator/types/search.rs | New bounded literal-search reader and anchor-advance computation. |
| src/evaluator/types/regex.rs | New binary-safe regex reader with scan-window computation and anchor-advance computation. |
| src/evaluator/types/mod.rs | Adds pattern-aware read/dispatch (read_typed_value_with_pattern, read_pattern_match) and bytes_consumed_with_pattern. |
| src/evaluator/strength.rs | Assigns default strength for Regex/Search types. |
| src/evaluator/engine/tests.rs | Adds engine-level tests for regex/search matching semantics and operator restrictions. |
| src/evaluator/engine/mod.rs | Updates engine evaluation to special-case pattern-bearing types and thread patterns into anchor advancement. |
| GOTCHAS.md | Documents new invariants/semantics for pattern-bearing types and corpus-testing guidance. |
| docs/src/evaluator.md | Documents Regex/Search in evaluator type-reading section and the pattern-aware API. |
| docs/solutions/integration-issues/implementing-variable-width-typekind-variant.md | New solution doc describing the integration work and pitfalls encountered. |
| Cargo.toml | Adds regex to main dependencies and bumps clap_complete. |
| Cargo.lock | Updates lockfile for clap_complete version bump. |
| AGENTS.md | Updates project capability docs to mark Regex/Search as implemented and describes semantics/limits. |
| .serena/project.yml | Documentation/formatting updates to Serena project configuration. |
| - **QDate**: 64-bit Unix timestamps (signed seconds since epoch) with configurable endianness and UTC/local time formatting | ||
| - **String**: Byte sequences with length limits | ||
| - **PString**: Pascal-style length-prefixed strings with 1-byte (`/B`), 2-byte (`/H` or `/h`), or 4-byte (`/L` or `/l`) length prefixes, supporting big-endian and little-endian byte order | ||
| - **Regex**: Binary-safe regex matching via `regex::bytes::Regex`; the `/c` flag enables case-insensitive matching and `/l` enables multi-line start-of-line anchoring |
There was a problem hiding this comment.
The docs claim the /l flag “enables multi-line start-of-line anchoring”, but the implementation compiles regexes with multi-line mode unconditionally and /l only changes how the scan window is bounded (line-count vs byte-count). Please update this bullet to reflect the actual /l semantics (and that ^/$ line-boundary behavior is always enabled).
| - **Regex**: Binary-safe regex matching via `regex::bytes::Regex`; the `/c` flag enables case-insensitive matching and `/l` enables multi-line start-of-line anchoring | |
| - **Regex**: Binary-safe regex matching via `regex::bytes::Regex`; the `/c` flag enables case-insensitive matching, `^`/`$` line-boundary behavior is always enabled, and `/l` switches scan-window bounding to line-count semantics rather than byte-count semantics |
| /// * `None`: use the 8192-byte default scan window. | ||
| /// * `Some(n)` with `flags.line_based == false`: scan at most `n` | ||
| /// bytes, capped at 8192. | ||
| /// * `Some(n)` with `flags.line_based == true`: scan at most `n` | ||
| /// lines, with an effective byte cap of `min(n * 80, 8192)`. | ||
| /// | ||
| /// The 8192-byte hard cap matches GNU `file`'s `FILE_REGEX_MAX` and | ||
| /// prevents runaway regex scans against large buffers. | ||
| count: Option<NonZeroU32>, |
There was a problem hiding this comment.
TypeKind::Regex rustdoc says line-based mode uses an effective byte cap of min(n * 80, 8192), but the evaluator’s compute_window implementation bounds the window by scanning for line terminators up to the Nth line (and always caps at 8192 bytes) — there is no n * 80 behavior. Please adjust the doc comment to match the implemented windowing semantics so consumers don’t rely on the wrong contract.
| /// For `TypeKind::Regex`, the pattern is required to re-run the match and | ||
| /// compute the consumed bytes. When the pattern is unavailable (or not a | ||
| /// string), the function returns `0` -- the anchor will then stay put and | ||
| /// the next relative offset resolves against the previous anchor position, | ||
| /// which is the same graceful-degradation behavior used by the other | ||
| /// defensive paths in this module. For `TypeKind::Search`, the pattern is | ||
| /// not needed because the consumed distance is the entire search window | ||
| /// regardless of where the match was found. Non-pattern types should pass | ||
| /// `pattern: None`. |
There was a problem hiding this comment.
This doc comment says TypeKind::Search doesn’t need the pattern because “the consumed distance is the entire search window regardless of where the match was found”, but the implementation re-scans and advances by match_idx + pattern.len() (and therefore requires the pattern). Please update the comment to reflect match-end semantics and the need for the pattern operand.
| /// For `TypeKind::Regex`, the pattern is required to re-run the match and | |
| /// compute the consumed bytes. When the pattern is unavailable (or not a | |
| /// string), the function returns `0` -- the anchor will then stay put and | |
| /// the next relative offset resolves against the previous anchor position, | |
| /// which is the same graceful-degradation behavior used by the other | |
| /// defensive paths in this module. For `TypeKind::Search`, the pattern is | |
| /// not needed because the consumed distance is the entire search window | |
| /// regardless of where the match was found. Non-pattern types should pass | |
| /// `pattern: None`. | |
| /// For `TypeKind::Regex` and `TypeKind::Search`, the pattern is required | |
| /// to re-run the match and compute the consumed bytes. For search, the | |
| /// consumed distance is measured to the end of the matched occurrence | |
| /// (`match_idx + pattern.len()`), not the full search window. When the | |
| /// pattern is unavailable (or not a string), the function returns `0` -- | |
| /// the anchor will then stay put and the next relative offset resolves | |
| /// against the previous anchor position, which is the same | |
| /// graceful-degradation behavior used by the other defensive paths in | |
| /// this module. Non-pattern types should pass `pattern: None`. |
| //! pattern within a bounded window. Unlike `TypeKind::String`, which only | ||
| //! matches at the exact offset, `search` advances through the buffer looking | ||
| //! for the first occurrence of the pattern anywhere in the window. The | ||
| //! search window is `buffer[offset..]` capped by the optional `range`. |
There was a problem hiding this comment.
Module-level docs describe the search window as being capped by an “optional range”, but TypeKind::Search stores range as NonZeroUsize and the grammar rejects bare search/search/0, so the range is mandatory. Please update the docs to avoid implying that an unbounded/optional range is supported.
| //! search window is `buffer[offset..]` capped by the optional `range`. | |
| //! search window is `buffer[offset..]` capped by the mandatory `range` | |
| //! (or the remaining buffer length, if smaller). |
| **Regex reader** (`src/evaluator/types/regex.rs`) — uses a `build_regex` helper that wraps the pattern in `^(?:...)` when `/l` is set so bare, unanchored patterns cannot match mid-line: | ||
|
|
||
| ```rust | ||
| fn build_regex( | ||
| pattern: &str, | ||
| case_insensitive: bool, | ||
| start_of_line: bool, | ||
| ) -> Result<Regex, regex::Error> { | ||
| let owned; | ||
| let effective_pattern: &str = if start_of_line { | ||
| owned = format!("^(?:{pattern})"); | ||
| &owned | ||
| } else { | ||
| pattern | ||
| }; | ||
| RegexBuilder::new(effective_pattern) | ||
| .case_insensitive(case_insensitive) | ||
| .multi_line(start_of_line) | ||
| .build() | ||
| } | ||
|
|
There was a problem hiding this comment.
This solution doc’s regex semantics (wrapping patterns with ^(?:...) when /l is set and toggling multi_line(start_of_line)) conflict with the current implementation and GOTCHAS: regex multi-line mode is unconditional, and /l controls only scan-window bounding (line-based window), not anchoring. The read_search signature shown here also uses an optional range, which no longer matches the mandatory NonZeroUsize range. Please update this doc to match the merged behavior so it doesn’t mislead future contributors.
|
Related Documentation 4 document(s) may need updating based on files changed in this PR: libMagic-rs ARCHITECTURE
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/parser/grammar/mod.rs (1)
394-563: 🛠️ Refactor suggestion | 🟠 MajorExtract the regex/search suffix parsers instead of growing
parse_type_and_operatorfurther.This function now owns pstring suffixes, regex modifiers, mandatory search ranges, attached
&mask, and fallback operator parsing. The behavior is fine, but the new branches make the operator-boundary rules much harder to audit and push an already-oversized parser file further past the repo’s size target. Pulling the regex/search pieces intoparse_regex_suffix(...)/parse_search_suffix(...)helpers would make this logic easier to test and maintain.As per coding guidelines, "Keep source files under 500-600 lines; split larger files into focused modules" and "Module structure: Use src/parser/ for parsing system with ast.rs, grammar/, types.rs, hierarchy.rs, loader.rs, and codegen.rs".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/parser/grammar/mod.rs` around lines 394 - 563, parse_type_and_operator is too large and contains inline parsing for regex and search suffixes which should be extracted; create two helper functions parse_regex_suffix(input: &str) -> IResult<&str, (RegexFlags, Option<u32>)> and parse_search_suffix(input: &str) -> IResult<&str, ::std::num::NonZeroUsize> (reuse existing parse_pstring_suffix for pstring) and replace the large inline blocks in parse_type_and_operator with calls to these helpers, ensuring they return the remaining rest string and parsed values (flags/count or range) and preserve the same error semantics and operator-boundary behavior so later logic (attached_op parsing and type_keyword_to_kind handling) continues to work unchanged.docs/src/evaluator.md (1)
595-596:⚠️ Potential issue | 🟡 MinorImplementation status is stale after this PR.
The checklist still shows regex support as unimplemented even though this change adds
TypeKind::Regex/TypeKind::Search, parser support, evaluator wiring, and tests. That will send readers to the wrong source of truth immediately after merge.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/src/evaluator.md` around lines 595 - 596, The docs checklist is out of date—mark Regex support as implemented: update the evaluator.md checklist entry for "Regex type support" to checked or remove it so it reflects that TypeKind::Regex and TypeKind::Search, parser support, evaluator wiring, and tests are present; ensure the doc text mentions the implemented symbols (TypeKind::Regex, TypeKind::Search, parser/evaluator wiring) so readers aren't directed to the wrong source of truth after merge.src/evaluator/types/mod.rs (1)
183-239:⚠️ Potential issue | 🟠 MajorKeep a lossless public API for pattern reads.
Lines 219-225 and 237-238 collapse
Ok(None)intoValue::String(""), while Lines 261-295 keep the only lossless matcher crate-private. That makes the new publicread_typed_value_with_patternambiguous for external callers: a regex miss is indistinguishable from a zero-width hit, and a search miss can masquerade as an empty-string match. Please either expose a losslessOption/enum-based API publicly or keep this helper internal until that contract exists.Also applies to: 261-295
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/evaluator/types/mod.rs` around lines 183 - 239, The public helper read_typed_value_with_pattern currently collapses None results from read_regex/read_search into Value::String("") for TypeKind::Regex and TypeKind::Search, losing the distinction between a miss and an empty/zero-width match; either change the public API to return an Option/enum that preserves None (e.g., Result<Option<Value>, TypeReadError> or a small enum MatchResult) or make read_typed_value_with_pattern crate-private/internal until you add that lossless return type; update callers accordingly and keep references to TypeKind::Regex, TypeKind::Search, read_regex, and read_search to locate the affected branches to implement the lossless return or visibility change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@AGENTS.md`:
- Line 307: Update the AGENTS.md text that claims regex/search coverage is
incomplete: locate the sentence "Note: Currently implemented types are `Byte`,
`Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`, `String`,
`PString`, `Regex`, and `Search`" and the later lines mentioning regex flag
completeness/search range enforcement (the "Current Limitations"/future-phase
notes) and revise them to reflect that `Regex` and `Search` support has been
implemented by this PR—remove or reword the TODO language so the document
consistently states that regex/search flags and range enforcement are
implemented, and update the "Current Limitations" section to only list remaining
true gaps.
In `@docs/src/evaluator.md`:
- Around line 135-157: Update the docs to reflect the current pattern behavior:
clarify that the /l flag no longer enables multiline anchoring but only sets a
line-based window size, that multiline matching is now unconditional (not gated
by /l), and that the engine does not always call read_typed_value_with_pattern
for pattern-bearing rules—evaluate_single_rule_with_anchor now calls
read_pattern_match to preserve zero-width-match vs miss semantics
(read_typed_value remains the None-pattern wrapper). Adjust the paragraphs
describing /l and the engine call path accordingly and reference
read_typed_value_with_pattern, read_pattern_match,
evaluate_single_rule_with_anchor, and read_typed_value when explaining the
corrected behavior.
In `@src/evaluator/engine/tests.rs`:
- Around line 2597-2605: Update the test comment in
test_search_parent_relative_child_at_positive_offset to reflect that this is a
positive-offset sanity check: change phrasing from “Sanity check the negative”
and any mention of `Relative(-N)` to indicate a positive relative offset (e.g.,
`OffsetSpec::Relative(1)`), and ensure the comment correctly describes that a
`Relative(1)` child resolves against the match-end anchor for `MagicRule` in
this test.
In `@src/evaluator/strength.rs`:
- Around line 80-93: The test_test_calculate_default_strength_table currently
doesn't cover the new scoring branches for TypeKind::Regex and TypeKind::Search;
update the test_test_calculate_default_strength_table to explicitly construct
TypeKind::Regex variants (one with count = Some(...) expecting strength 25 and
one with count = None expecting strength 20) and a TypeKind::Search (expecting
strength 25), call calculate_default_strength_table (or the library function
that maps TypeKind -> strength) and assert those expected strength values so
future changes to scoring will fail the test if regressed; reference
TypeKind::Regex, TypeKind::Search, calculate_default_strength_table, and
test_calculate_default_strength_table when adding the assertions.
In `@src/evaluator/types/mod.rs`:
- Around line 375-383: Update the rustdoc for the TypeKind handling: change the
incorrect statement that "TypeKind::Search does not need the pattern and
consumes the full window" to accurately say that TypeKind::Search uses the
provided pattern/needle to find the match end and computes the consumed distance
from the anchor to the end of that match (i.e., callers pass the needle into
search_bytes_consumed to determine the consumed bytes), and mention the fallback
behavior when the pattern is missing or not a string; also update the similar
comment near the search_bytes_consumed usage (the lines that pass the needle
into search_bytes_consumed) so both places describe the current match-end
behavior consistently.
In `@src/evaluator/types/regex.rs`:
- Around line 82-121: The window walker in compute_window (the logic using
buffer.get(offset..), byte_cap, capped and the while loop over capped[idx]) must
be rewritten to use the SafeBufferAccess/.get() pattern everywhere instead of
direct slicing or indexing; replace direct slices like &capped[..count_bytes]
and &capped[..idx] and indexed access capped[idx] and capped[idx+1] with calls
to buffer.get(...) or capped.get(...) and handle Option results (early return or
break) to maintain the same CR/CRLF and line-count behavior, keeping the same
REGEX_MAX_BYTES cap, the flags.line_based branch, and the
target_lines/lines_seen logic but using safe .get() lookups and saturation
semantics.
In `@src/evaluator/types/search.rs`:
- Around line 53-55: Both read_search and search_bytes_consumed create a window
using direct slicing (let remaining = &buffer[offset..]; let window =
&remaining[..window_len];) which violates the SafeBufferAccess pattern; replace
these raw slices with SafeBufferAccess .get() calls to perform bounds-checked
access (e.g. use buffer.get(offset..offset+window_len) or
buffer.get(offset..).and_then(|r| r.get(..window_len))) and handle the Option
result consistently. Update the code paths that compute window_len from
range.get() so they call .get(range_start..range_start+window_len) (or chain
.get on remaining) and return/propagate a safe empty/nonexistent slice when
.get() yields None; apply the same change to the other occurrence around lines
~90-100 so both read_search and search_bytes_consumed use .get() for all buffer
windows.
In `@src/parser/ast.rs`:
- Around line 383-394: The doc for the AST field `count: Option<NonZeroU32>` is
incorrect about line-based behavior; update the rustdoc on `count` (and nearby
comment for `/l`) to reflect the real semantics implemented in
`evaluator::types::regex.rs` (see `compute_window`): when `flags.line_based ==
true` the code advances to the Nth line terminator and only afterward enforces
the 8192-byte hard cap, not an a priori byte estimate of `min(n * 80, 8192)`;
add a short example showing `regex/Nl` walking to the Nth newline then trimming
to 8192 bytes and a brief note about the performance/design tradeoff to justify
walking lines before the cap.
In `@src/parser/codegen.rs`:
- Around line 235-253: The generated code emits
libmagic_rs::parser::ast::RegexFlags which doesn't match in-crate imports; in
the TypeKind::Regex branch of src/parser/codegen.rs (the match arm constructing
the format! string for Regex), replace the path
libmagic_rs::parser::ast::RegexFlags with crate::parser::ast::RegexFlags so the
produced string uses crate::parser::ast::RegexFlags {{ case_insensitive: ...,
start_offset: ..., line_based: ... }} and keep the existing flags.field
interpolations and count handling unchanged.
In `@src/parser/grammar/tests/mod.rs`:
- Around line 2279-2446: Extract the regex/search tests (functions
test_parse_type_and_operator_regex_and_search_suffixes,
test_parse_type_and_operator_search_requires_range,
test_parse_type_and_operator_regex_invalid_suffix,
test_parse_type_and_operator_regex_operator_adjacent,
test_parse_magic_rule_regex_and_search) into a new test file named
src/parser/grammar/tests/regex_search.rs, preserving all use/imports (e.g.
crate::parser::ast::{RegexFlags, TypeKind}, OffsetSpec, Operator, Value) and
helper functions rx and sr; remove those tests from the large mod.rs and add mod
regex_search; to the tests module so the new file is compiled, ensuring any
relative paths remain crate::... so nothing else needs adjusting, then run cargo
test to verify.
In `@src/parser/types.rs`:
- Around line 319-321: Replace the panic-based construction in the
TypeKind::Search mapping by removing the expect() call and using a safe,
explicit non-panicking construction; specifically, change
::std::num::NonZeroUsize::new(1).expect("1 is nonzero") to an explicit unsafe
construction like unsafe { ::std::num::NonZeroUsize::new_unchecked(1) } (with a
brief comment that 1 is statically non-zero) so the library code doesn't call
expect()/panic.
---
Outside diff comments:
In `@docs/src/evaluator.md`:
- Around line 595-596: The docs checklist is out of date—mark Regex support as
implemented: update the evaluator.md checklist entry for "Regex type support" to
checked or remove it so it reflects that TypeKind::Regex and TypeKind::Search,
parser support, evaluator wiring, and tests are present; ensure the doc text
mentions the implemented symbols (TypeKind::Regex, TypeKind::Search,
parser/evaluator wiring) so readers aren't directed to the wrong source of truth
after merge.
In `@src/evaluator/types/mod.rs`:
- Around line 183-239: The public helper read_typed_value_with_pattern currently
collapses None results from read_regex/read_search into Value::String("") for
TypeKind::Regex and TypeKind::Search, losing the distinction between a miss and
an empty/zero-width match; either change the public API to return an Option/enum
that preserves None (e.g., Result<Option<Value>, TypeReadError> or a small enum
MatchResult) or make read_typed_value_with_pattern crate-private/internal until
you add that lossless return type; update callers accordingly and keep
references to TypeKind::Regex, TypeKind::Search, read_regex, and read_search to
locate the affected branches to implement the lossless return or visibility
change.
In `@src/parser/grammar/mod.rs`:
- Around line 394-563: parse_type_and_operator is too large and contains inline
parsing for regex and search suffixes which should be extracted; create two
helper functions parse_regex_suffix(input: &str) -> IResult<&str, (RegexFlags,
Option<u32>)> and parse_search_suffix(input: &str) -> IResult<&str,
::std::num::NonZeroUsize> (reuse existing parse_pstring_suffix for pstring) and
replace the large inline blocks in parse_type_and_operator with calls to these
helpers, ensuring they return the remaining rest string and parsed values
(flags/count or range) and preserve the same error semantics and
operator-boundary behavior so later logic (attached_op parsing and
type_keyword_to_kind handling) continues to work unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ad7f0fa2-ae76-4ad8-a150-7da5774a11fa
⛔ Files ignored due to path filters (6)
.serena/project.ymlis excluded by none and included by noneCargo.lockis excluded by!**/*.lockand included by nonedocs/solutions/integration-issues/implementing-variable-width-typekind-variant.mdis excluded by none and included by nonetests/evaluator_tests.rsis excluded by none and included by nonetests/property_tests.rsis excluded by none and included by nonetests/regex_search_corpus_tests.rsis excluded by none and included by none
📒 Files selected for processing (16)
AGENTS.mdCargo.tomlGOTCHAS.mddocs/src/evaluator.mdsrc/evaluator/engine/mod.rssrc/evaluator/engine/tests.rssrc/evaluator/strength.rssrc/evaluator/types/mod.rssrc/evaluator/types/regex.rssrc/evaluator/types/search.rssrc/evaluator/types/tests.rssrc/parser/ast.rssrc/parser/codegen.rssrc/parser/grammar/mod.rssrc/parser/grammar/tests/mod.rssrc/parser/types.rs
| ### Adding New Type Support | ||
|
|
||
| > **Note:** Currently implemented types are `Byte`, `Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`, `String`, and `PString`. Regex and search types are planned for future releases. | ||
| > **Note:** Currently implemented types are `Byte`, `Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`, `String`, `PString`, `Regex`, and `Search`. See "Current Limitations" for the remaining gaps in regex/search flag coverage. |
There was a problem hiding this comment.
AGENTS still presents regex/search work as incomplete.
Line 307 refers readers to “remaining gaps in regex/search flag coverage”, and Lines 462-463 still put regex flag completeness/search range enforcement in a future phase. That now conflicts with the implementation this PR adds, so future contributors will get mixed guidance from the same document.
Also applies to: 462-463
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@AGENTS.md` at line 307, Update the AGENTS.md text that claims regex/search
coverage is incomplete: locate the sentence "Note: Currently implemented types
are `Byte`, `Short`, `Long`, `Quad`, `Float`, `Double`, `Date`, `QDate`,
`String`, `PString`, `Regex`, and `Search`" and the later lines mentioning regex
flag completeness/search range enforcement (the "Current
Limitations"/future-phase notes) and revise them to reflect that `Regex` and
`Search` support has been implemented by this PR—remove or reword the TODO
language so the document consistently states that regex/search flags and range
enforcement are implemented, and update the "Current Limitations" section to
only list remaining true gaps.
| - **Regex**: Binary-safe regex matching via `regex::bytes::Regex`; the `/c` flag enables case-insensitive matching and `/l` enables multi-line start-of-line anchoring | ||
| - **Search**: Bounded literal pattern scan via `memchr::memmem::find`; `search/N` caps the scan window to `N` bytes from the offset | ||
| - **Bounds checking**: Prevents buffer overruns | ||
|
|
||
| ```rust | ||
| // Non-pattern types use the 3-arg convenience wrapper: | ||
| pub fn read_typed_value( | ||
| buffer: &[u8], | ||
| offset: usize, | ||
| type_kind: &TypeKind, | ||
| ) -> Result<Value, TypeReadError> | ||
|
|
||
| // Pattern-bearing types (Regex, Search) thread the rule's value operand | ||
| // through as the match pattern: | ||
| pub fn read_typed_value_with_pattern( | ||
| buffer: &[u8], | ||
| offset: usize, | ||
| type_kind: &TypeKind, | ||
| pattern: Option<&Value>, | ||
| ) -> Result<Value, TypeReadError> | ||
| ``` | ||
|
|
||
| The engine uses `read_typed_value_with_pattern` uniformly and passes `Some(&rule.value)` for every rule; the convenience `read_typed_value` is a thin wrapper that forwards `pattern: None`. For pattern-bearing types a genuine "no match" is collapsed to `Value::String(String::new())` in the `read_typed_value_with_pattern` return so the back-compat `Value` shape is preserved; the engine instead calls `read_pattern_match` directly, which returns `Result<Option<Value>, _>` so zero-width matches (e.g. `^`, `a*`) can be distinguished from genuine misses. |
There was a problem hiding this comment.
These docs still describe the pre-fix regex/search behavior.
Line 135 says /l enables multiline anchoring, but the implementation now treats /l only as line-based window sizing; multiline matching is unconditional. The paragraph at Line 157 also says the engine “uses read_typed_value_with_pattern uniformly”, but evaluate_single_rule_with_anchor now special-cases pattern-bearing types through read_pattern_match specifically to preserve zero-width-match vs miss semantics.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/src/evaluator.md` around lines 135 - 157, Update the docs to reflect the
current pattern behavior: clarify that the /l flag no longer enables multiline
anchoring but only sets a line-based window size, that multiline matching is now
unconditional (not gated by /l), and that the engine does not always call
read_typed_value_with_pattern for pattern-bearing
rules—evaluate_single_rule_with_anchor now calls read_pattern_match to preserve
zero-width-match vs miss semantics (read_typed_value remains the None-pattern
wrapper). Adjust the paragraphs describing /l and the engine call path
accordingly and reference read_typed_value_with_pattern, read_pattern_match,
evaluate_single_rule_with_anchor, and read_typed_value when explaining the
corrected behavior.
| /// Sanity check the negative: when the parent search finds the pattern | ||
| /// early in the window, a Relative(-N) child should still resolve against | ||
| /// the match-end anchor. This catches a class of bugs where the anchor | ||
| /// update uses the wrong base offset. | ||
| #[test] | ||
| fn test_search_parent_relative_child_at_positive_offset() { | ||
| // Buffer: "prefix_NEEDLE_after_stuff" -- "NEEDLE" is at index 7, len | ||
| // 6, match-end = 13. A Relative(1) child should read byte 14 = 'a'. | ||
| let child = MagicRule { |
There was a problem hiding this comment.
Correct the test comment wording (“negative” vs positive offset).
The comment says “Sanity check the negative” and references Relative(-N), but the test uses OffsetSpec::Relative(1). Update wording to avoid future confusion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/evaluator/engine/tests.rs` around lines 2597 - 2605, Update the test
comment in test_search_parent_relative_child_at_positive_offset to reflect that
this is a positive-offset sanity check: change phrasing from “Sanity check the
negative” and any mention of `Relative(-N)` to indicate a positive relative
offset (e.g., `OffsetSpec::Relative(1)`), and ensure the comment correctly
describes that a `Relative(1)` child resolves against the match-end anchor for
`MagicRule` in this test.
| // Regex matches a pattern -- treat similarly to an unbounded string. | ||
| // A rule with an explicit `count` is more constrained (narrower scan | ||
| // window) and therefore more specific. | ||
| TypeKind::Regex { count, .. } => { | ||
| if count.is_some() { | ||
| 25 | ||
| } else { | ||
| 20 | ||
| } | ||
| } | ||
| // Search is always a bounded scan (the range is mandatory), so it | ||
| // gets the "constrained match" bonus unconditionally. This matches | ||
| // the max_length bonus used for String and PString. | ||
| TypeKind::Search { .. } => 25, |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add explicit strength tests for Regex and Search branches.
This introduces new scoring paths but test_calculate_default_strength_table doesn’t assert them yet. Add explicit cases so future tuning doesn’t regress silently.
As per coding guidelines: src/**/*.rs: Every PR must include tests for new functionality.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/evaluator/strength.rs` around lines 80 - 93, The
test_test_calculate_default_strength_table currently doesn't cover the new
scoring branches for TypeKind::Regex and TypeKind::Search; update the
test_test_calculate_default_strength_table to explicitly construct
TypeKind::Regex variants (one with count = Some(...) expecting strength 25 and
one with count = None expecting strength 20) and a TypeKind::Search (expecting
strength 25), call calculate_default_strength_table (or the library function
that maps TypeKind -> strength) and assert those expected strength values so
future changes to scoring will fail the test if regressed; reference
TypeKind::Regex, TypeKind::Search, calculate_default_strength_table, and
test_calculate_default_strength_table when adding the assertions.
| /// For `TypeKind::Regex`, the pattern is required to re-run the match and | ||
| /// compute the consumed bytes. When the pattern is unavailable (or not a | ||
| /// string), the function returns `0` -- the anchor will then stay put and | ||
| /// the next relative offset resolves against the previous anchor position, | ||
| /// which is the same graceful-degradation behavior used by the other | ||
| /// defensive paths in this module. For `TypeKind::Search`, the pattern is | ||
| /// not needed because the consumed distance is the entire search window | ||
| /// regardless of where the match was found. Non-pattern types should pass | ||
| /// `pattern: None`. |
There was a problem hiding this comment.
Fix the search anchor rustdoc.
Lines 380-382 say TypeKind::Search does not need the pattern and consumes the full window, but Lines 459-465 pass the needle into search_bytes_consumed. Please update the comment to describe the current match-end behavior; right now the docs encode the wrong invariant for future refactors.
Also applies to: 459-465
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/evaluator/types/mod.rs` around lines 375 - 383, Update the rustdoc for
the TypeKind handling: change the incorrect statement that "TypeKind::Search
does not need the pattern and consumes the full window" to accurately say that
TypeKind::Search uses the provided pattern/needle to find the match end and
computes the consumed distance from the anchor to the end of that match (i.e.,
callers pass the needle into search_bytes_consumed to determine the consumed
bytes), and mention the fallback behavior when the pattern is missing or not a
string; also update the similar comment near the search_bytes_consumed usage
(the lines that pass the needle into search_bytes_consumed) so both places
describe the current match-end behavior consistently.
| let remaining = &buffer[offset..]; | ||
| let window_len = range.get().min(remaining.len()); | ||
| let window = &remaining[..window_len]; |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Replace direct slices with .get()-based window construction.
Both read_search and search_bytes_consumed build their scan windows with direct slicing. That works because window_len is pre-clamped, but evaluator code in this repo is supposed to stay on the .get()/SafeBufferAccess path consistently.
Proposed direction
- let window = &remaining[..window_len];
+ let Some(window) = remaining.get(..window_len) else {
+ return Ok(None);
+ };- let window = &remaining[..window_len];
+ let Some(window) = remaining.get(..window_len) else {
+ return 0;
+ };Also applies to: 90-100
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/evaluator/types/search.rs` around lines 53 - 55, Both read_search and
search_bytes_consumed create a window using direct slicing (let remaining =
&buffer[offset..]; let window = &remaining[..window_len];) which violates the
SafeBufferAccess pattern; replace these raw slices with SafeBufferAccess .get()
calls to perform bounds-checked access (e.g. use
buffer.get(offset..offset+window_len) or buffer.get(offset..).and_then(|r|
r.get(..window_len))) and handle the Option result consistently. Update the code
paths that compute window_len from range.get() so they call
.get(range_start..range_start+window_len) (or chain .get on remaining) and
return/propagate a safe empty/nonexistent slice when .get() yields None; apply
the same change to the other occurrence around lines ~90-100 so both read_search
and search_bytes_consumed use .get() for all buffer windows.
| /// Optional numeric count from `regex/N[flags]`. Interpretation | ||
| /// depends on `flags.line_based`: | ||
| /// | ||
| /// * `None`: use the 8192-byte default scan window. | ||
| /// * `Some(n)` with `flags.line_based == false`: scan at most `n` | ||
| /// bytes, capped at 8192. | ||
| /// * `Some(n)` with `flags.line_based == true`: scan at most `n` | ||
| /// lines, with an effective byte cap of `min(n * 80, 8192)`. | ||
| /// | ||
| /// The 8192-byte hard cap matches GNU `file`'s `FILE_REGEX_MAX` and | ||
| /// prevents runaway regex scans against large buffers. | ||
| count: Option<NonZeroU32>, |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Document /l the way compute_window actually behaves.
These field docs say line-based mode has an effective byte cap of min(n * 80, 8192), but the implementation in src/evaluator/types/regex.rs walks forward to the Nth line terminator and only then applies the 8192-byte hard cap. As written, the public AST docs give the wrong mental model for regex/Nl.
Proposed doc fix
- /// * `Some(n)` with `flags.line_based == true`: scan at most `n`
- /// lines, with an effective byte cap of `min(n * 80, 8192)`.
+ /// * `Some(n)` with `flags.line_based == true`: scan through the
+ /// end of the Nth line terminator, capped at 8192 bytes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/parser/ast.rs` around lines 383 - 394, The doc for the AST field `count:
Option<NonZeroU32>` is incorrect about line-based behavior; update the rustdoc
on `count` (and nearby comment for `/l`) to reflect the real semantics
implemented in `evaluator::types::regex.rs` (see `compute_window`): when
`flags.line_based == true` the code advances to the Nth line terminator and only
afterward enforces the 8192-byte hard cap, not an a priori byte estimate of
`min(n * 80, 8192)`; add a short example showing `regex/Nl` walking to the Nth
newline then trimming to 8192 bytes and a brief note about the
performance/design tradeoff to justify walking lines before the cap.
| TypeKind::Regex { flags, count } => { | ||
| let count_lit = match count { | ||
| Some(n) => format!("::std::num::NonZeroU32::new({}).unwrap()", n.get()), | ||
| None => String::new(), | ||
| }; | ||
| let count_expr = if count.is_some() { | ||
| format!("Some({count_lit})") | ||
| } else { | ||
| "None".to_string() | ||
| }; | ||
| format!( | ||
| "TypeKind::Regex {{ flags: libmagic_rs::parser::ast::RegexFlags {{ case_insensitive: {}, start_offset: {}, line_based: {} }}, count: {count_expr} }}", | ||
| flags.case_insensitive, flags.start_offset, flags.line_based | ||
| ) | ||
| } | ||
| TypeKind::Search { range } => format!( | ||
| "TypeKind::Search {{ range: ::std::num::NonZeroUsize::new({}).unwrap() }}", | ||
| range.get() | ||
| ), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "1) Check whether crate self-alias exists:"
rg -n --type=rust 'extern crate self as libmagic_rs' src build.rs || true
echo
echo "2) Find generated-source import style and emitted RegexFlags path:"
rg -n --type=rust 'use crate::parser::ast::|libmagic_rs::parser::ast::RegexFlags' src/parser/codegen.rs
echo
echo "3) Find where generated builtin rules are included/consumed:"
rg -n --type=rust 'generate_builtin_rules|BUILTIN_RULES|include!\(' src build.rsRepository: EvilBit-Labs/libmagic-rs
Length of output: 2533
Fix generated code path for RegexFlags to use crate:: prefix.
The codegen emits libmagic_rs::parser::ast::RegexFlags on line 246, but generated code imports from crate::parser::ast::... (line 32) with no crate self-alias. This causes a compilation error when Regex rules are serialized. Change to crate::parser::ast::RegexFlags to match the generated import style and ensure in-crate resolution.
Suggested patch
- format!(
- "TypeKind::Regex {{ flags: libmagic_rs::parser::ast::RegexFlags {{ case_insensitive: {}, start_offset: {}, line_based: {} }}, count: {count_expr} }}",
- flags.case_insensitive, flags.start_offset, flags.line_based
- )
+ format!(
+ "TypeKind::Regex {{ flags: crate::parser::ast::RegexFlags {{ case_insensitive: {}, start_offset: {}, line_based: {} }}, count: {count_expr} }}",
+ flags.case_insensitive, flags.start_offset, flags.line_based
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| TypeKind::Regex { flags, count } => { | |
| let count_lit = match count { | |
| Some(n) => format!("::std::num::NonZeroU32::new({}).unwrap()", n.get()), | |
| None => String::new(), | |
| }; | |
| let count_expr = if count.is_some() { | |
| format!("Some({count_lit})") | |
| } else { | |
| "None".to_string() | |
| }; | |
| format!( | |
| "TypeKind::Regex {{ flags: libmagic_rs::parser::ast::RegexFlags {{ case_insensitive: {}, start_offset: {}, line_based: {} }}, count: {count_expr} }}", | |
| flags.case_insensitive, flags.start_offset, flags.line_based | |
| ) | |
| } | |
| TypeKind::Search { range } => format!( | |
| "TypeKind::Search {{ range: ::std::num::NonZeroUsize::new({}).unwrap() }}", | |
| range.get() | |
| ), | |
| TypeKind::Regex { flags, count } => { | |
| let count_lit = match count { | |
| Some(n) => format!("::std::num::NonZeroU32::new({}).unwrap()", n.get()), | |
| None => String::new(), | |
| }; | |
| let count_expr = if count.is_some() { | |
| format!("Some({count_lit})") | |
| } else { | |
| "None".to_string() | |
| }; | |
| format!( | |
| "TypeKind::Regex {{ flags: crate::parser::ast::RegexFlags {{ case_insensitive: {}, start_offset: {}, line_based: {} }}, count: {count_expr} }}", | |
| flags.case_insensitive, flags.start_offset, flags.line_based | |
| ) | |
| } | |
| TypeKind::Search { range } => format!( | |
| "TypeKind::Search {{ range: ::std::num::NonZeroUsize::new({}).unwrap() }}", | |
| range.get() | |
| ), |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/parser/codegen.rs` around lines 235 - 253, The generated code emits
libmagic_rs::parser::ast::RegexFlags which doesn't match in-crate imports; in
the TypeKind::Regex branch of src/parser/codegen.rs (the match arm constructing
the format! string for Regex), replace the path
libmagic_rs::parser::ast::RegexFlags with crate::parser::ast::RegexFlags so the
produced string uses crate::parser::ast::RegexFlags {{ case_insensitive: ...,
start_offset: ..., line_based: ... }} and keep the existing flags.field
interpolations and count handling unchanged.
| #[test] | ||
| fn test_parse_type_and_operator_regex_and_search_suffixes() { | ||
| use crate::parser::ast::{RegexFlags, TypeKind}; | ||
| use std::num::{NonZeroU32, NonZeroUsize}; | ||
|
|
||
| fn rx(case: bool, start: bool, line: bool, count: Option<u32>) -> TypeKind { | ||
| TypeKind::Regex { | ||
| flags: RegexFlags { | ||
| case_insensitive: case, | ||
| start_offset: start, | ||
| line_based: line, | ||
| }, | ||
| count: count.and_then(NonZeroU32::new), | ||
| } | ||
| } | ||
| fn sr(n: usize) -> TypeKind { | ||
| TypeKind::Search { | ||
| range: NonZeroUsize::new(n).unwrap(), | ||
| } | ||
| } | ||
|
|
||
| let cases: &[(&str, TypeKind, &str)] = &[ | ||
| ("regex", rx(false, false, false, None), ""), | ||
| ("regex/c", rx(true, false, false, None), ""), | ||
| ("regex/l", rx(false, false, true, None), ""), | ||
| ("regex/s", rx(false, true, false, None), ""), | ||
| ("regex/cl", rx(true, false, true, None), ""), | ||
| ("regex/lc", rx(true, false, true, None), ""), | ||
| ("regex/cs", rx(true, true, false, None), ""), | ||
| ("regex/csl", rx(true, true, true, None), ""), | ||
| ("regex/1l", rx(false, false, true, Some(1)), ""), | ||
| ("regex/l1", rx(false, false, true, Some(1)), ""), | ||
| ("regex/1c", rx(true, false, false, Some(1)), ""), | ||
| ("regex/256", rx(false, false, false, Some(256)), ""), | ||
| ("regex/c =", rx(true, false, false, None), "="), | ||
| ("search/256", sr(256), ""), | ||
| ("search/1", sr(1), ""), | ||
| ("search/256 =", sr(256), "="), | ||
| ]; | ||
| for &(input, ref expected_kind, expected_rest) in cases { | ||
| let (rest, (kind, op)) = parse_type_and_operator(input).expect(input); | ||
| assert_eq!(rest, expected_rest, "rest for input: {input}"); | ||
| assert!(op.is_none(), "operator for input: {input}"); | ||
| assert_eq!(&kind, expected_kind, "kind for input: {input}"); | ||
| } | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_parse_type_and_operator_search_requires_range() { | ||
| // Bare `search` (no /N suffix) is a hard parse error per GNU `file`. | ||
| assert!(parse_type_and_operator("search").is_err()); | ||
| // `search/0` is also rejected -- `NonZeroUsize` makes a zero-width | ||
| // scan unrepresentable. | ||
| assert!(parse_type_and_operator("search/0").is_err()); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_parse_type_and_operator_regex_invalid_suffix() { | ||
| // Bare slash with no flags or count | ||
| assert!(parse_type_and_operator("regex/").is_err()); | ||
| // Unrecognized flag letter | ||
| assert!(parse_type_and_operator("regex/z").is_err()); | ||
| // Non-operator trailing character is still rejected | ||
| assert!(parse_type_and_operator("regex/cz").is_err()); | ||
| // regex/0 is rejected because a zero count has no valid semantics | ||
| // (our parser uses NonZeroU32 to express "user specified a count"). | ||
| assert!(parse_type_and_operator("regex/0").is_err()); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_parse_type_and_operator_regex_operator_adjacent() { | ||
| use crate::parser::ast::{Operator, RegexFlags, TypeKind}; | ||
|
|
||
| // `regex/c=` should leave `=` for parse_operator, matching the `regex/c =` | ||
| // (space-separated) behavior and mirroring `search/256=`. | ||
| let (rest, (kind, op)) = parse_type_and_operator("regex/c=").expect("regex/c="); | ||
| assert_eq!(rest, "="); | ||
| assert!(op.is_none()); | ||
| assert_eq!( | ||
| kind, | ||
| TypeKind::Regex { | ||
| flags: RegexFlags { | ||
| case_insensitive: true, | ||
| ..RegexFlags::default() | ||
| }, | ||
| count: None, | ||
| } | ||
| ); | ||
|
|
||
| // `regex/l!=` should leave `!=` for parse_operator. | ||
| let (rest, (kind, op)) = parse_type_and_operator("regex/l!=").expect("regex/l!="); | ||
| assert_eq!(rest, "!="); | ||
| assert!(op.is_none()); | ||
| assert_eq!( | ||
| kind, | ||
| TypeKind::Regex { | ||
| flags: RegexFlags { | ||
| line_based: true, | ||
| ..RegexFlags::default() | ||
| }, | ||
| count: None, | ||
| } | ||
| ); | ||
|
|
||
| // Confirm the full pipeline parses the operator correctly through | ||
| // parse_type_and_operator + parse_operator chaining. | ||
| let (rest, (_, _)) = parse_type_and_operator("regex/c=foo").expect("regex/c=foo"); | ||
| let (rest_after_op, op) = crate::parser::grammar::parse_operator(rest).expect("operator"); | ||
| assert_eq!(op, Operator::Equal); | ||
| assert_eq!(rest_after_op, "foo"); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_parse_magic_rule_regex_and_search() { | ||
| use crate::parser::ast::RegexFlags; | ||
| use std::num::{NonZeroU32, NonZeroUsize}; | ||
|
|
||
| // regex/c: case-insensitive flag | ||
| let input = r#"0 regex/c "hello" case-insensitive match"#; | ||
| let (remaining, rule) = parse_magic_rule(input).unwrap(); | ||
| assert_eq!(remaining, ""); | ||
| assert_eq!(rule.offset, OffsetSpec::Absolute(0)); | ||
| assert_eq!( | ||
| rule.typ, | ||
| TypeKind::Regex { | ||
| flags: RegexFlags { | ||
| case_insensitive: true, | ||
| ..RegexFlags::default() | ||
| }, | ||
| count: None, | ||
| } | ||
| ); | ||
| assert_eq!(rule.op, Operator::Equal); | ||
| assert_eq!(rule.value, Value::String("hello".to_string())); | ||
| assert_eq!(rule.message, "case-insensitive match"); | ||
|
|
||
| // search/256 | ||
| let input = r#"0 search/256 "MZ" DOS executable"#; | ||
| let (remaining, rule) = parse_magic_rule(input).unwrap(); | ||
| assert_eq!(remaining, ""); | ||
| assert_eq!( | ||
| rule.typ, | ||
| TypeKind::Search { | ||
| range: NonZeroUsize::new(256).unwrap(), | ||
| } | ||
| ); | ||
| assert_eq!(rule.op, Operator::Equal); | ||
| assert_eq!(rule.value, Value::String("MZ".to_string())); | ||
| assert_eq!(rule.message, "DOS executable"); | ||
|
|
||
| // regex/1l: line-based with a count of 1 (mirrors regex-eol.magic | ||
| // syntax). The count is now preserved, not discarded. | ||
| let input = r#">1 regex/1l "[0-9]+" version line"#; | ||
| let (remaining, rule) = parse_magic_rule(input).unwrap(); | ||
| assert_eq!(remaining, ""); | ||
| assert_eq!(rule.level, 1); | ||
| assert_eq!( | ||
| rule.typ, | ||
| TypeKind::Regex { | ||
| flags: RegexFlags { | ||
| line_based: true, | ||
| ..RegexFlags::default() | ||
| }, | ||
| count: NonZeroU32::new(1), | ||
| } | ||
| ); | ||
| assert_eq!(rule.message, "version line"); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Split new regex/search tests into a dedicated test module.
Coverage is solid, but this file is already far beyond the repository size guideline. Moving this block to src/parser/grammar/tests/regex_search.rs will keep future maintenance manageable.
As per coding guidelines: src/**/*.rs: Keep source files under 500-600 lines; split larger files into focused modules.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/parser/grammar/tests/mod.rs` around lines 2279 - 2446, Extract the
regex/search tests (functions
test_parse_type_and_operator_regex_and_search_suffixes,
test_parse_type_and_operator_search_requires_range,
test_parse_type_and_operator_regex_invalid_suffix,
test_parse_type_and_operator_regex_operator_adjacent,
test_parse_magic_rule_regex_and_search) into a new test file named
src/parser/grammar/tests/regex_search.rs, preserving all use/imports (e.g.
crate::parser::ast::{RegexFlags, TypeKind}, OffsetSpec, Operator, Value) and
helper functions rx and sr; remove those tests from the large mod.rs and add mod
regex_search; to the tests module so the new file is compiled, ensuring any
relative paths remain crate::... so nothing else needs adjusting, then run cargo
test to verify.
| "search" => TypeKind::Search { | ||
| range: ::std::num::NonZeroUsize::new(1).expect("1 is nonzero"), | ||
| }, |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Remove expect() from library mapping path.
This is a constant path; avoid panic-based construction here.
Suggested patch
"search" => TypeKind::Search {
- range: ::std::num::NonZeroUsize::new(1).expect("1 is nonzero"),
+ range: ::std::num::NonZeroUsize::MIN,
},As per coding guidelines: src/**/*.rs: No unwrap() or expect() in library code.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "search" => TypeKind::Search { | |
| range: ::std::num::NonZeroUsize::new(1).expect("1 is nonzero"), | |
| }, | |
| "search" => TypeKind::Search { | |
| range: ::std::num::NonZeroUsize::MIN, | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/parser/types.rs` around lines 319 - 321, Replace the panic-based
construction in the TypeKind::Search mapping by removing the expect() call and
using a safe, explicit non-panicking construction; specifically, change
::std::num::NonZeroUsize::new(1).expect("1 is nonzero") to an explicit unsafe
construction like unsafe { ::std::num::NonZeroUsize::new_unchecked(1) } (with a
brief comment that 1 is statically non-zero) so the library code doesn't call
expect()/panic.
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/src/ast-structures.md`:
- Around line 454-456: Update the documentation entry describing search anchor
behavior so it accurately states that the anchor advances to the end of the
matched pattern (match-end) rather than by the full search window; specifically
modify the bullet that currently reads "Anchor advances by the entire search
window" to "Anchor advances to the end of the matched pattern (match-end)".
Ensure the description still notes the return type `Value::String` for matched
bytes and that only `Equal` and `NotEqual` operators are supported so consumers
of the Search/anchor semantics have correct expectations.
In `@docs/src/parser.md`:
- Line 353: The docs sentence incorrectly lists "lookaheads" as an example of
zero-width matches; update the wording near the `Value::String("")` example so
it only mentions supported constructs (e.g., `^`, `$`, `a*`) and remove any
reference to look-around syntax, since `regex::bytes::Regex` does not support
lookaheads; ensure the sentence still explains that zero-width matches are
preserved as `Value::String("")` and distinguish them from genuine misses.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 947f1be6-454e-4b11-9a53-23e230972195
⛔ Files ignored due to path filters (1)
docs/ARCHITECTURE.mdis excluded by none and included by none
📒 Files selected for processing (3)
docs/src/ast-structures.mddocs/src/compatibility.mddocs/src/parser.md
| - Returns `Value::String` containing the matched bytes if found within range | ||
| - Anchor advances by the entire search window regardless of where the match was found | ||
| - Only supports `Equal` and `NotEqual` operators |
There was a problem hiding this comment.
Fix Search anchor advancement semantics.
The docs currently say the anchor advances by the full search window. This should state it advances to the end of the matched pattern (match-end), which is the implemented behavior.
As per coding guidelines: "docs/**: Review documentation quality, completeness, and accuracy. Ensure comprehensive API documentation, usage examples, and migration guides. Focus on user experience and developer onboarding."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/src/ast-structures.md` around lines 454 - 456, Update the documentation
entry describing search anchor behavior so it accurately states that the anchor
advances to the end of the matched pattern (match-end) rather than by the full
search window; specifically modify the bullet that currently reads "Anchor
advances by the entire search window" to "Anchor advances to the end of the
matched pattern (match-end)". Ensure the description still notes the return type
`Value::String` for matched bytes and that only `Equal` and `NotEqual` operators
are supported so consumers of the Search/anchor semantics have correct
expectations.
|
|
||
| - Patterns are compiled with multi-line mode always enabled (matching libmagic's unconditional `REG_NEWLINE`), so `^` and `$` match at line boundaries and `.` does not match `\n`. | ||
| - The scan window is always capped at 8192 bytes regardless of the `count` value. | ||
| - Zero-width matches (`^`, `a*`, lookaheads) are preserved as `Value::String("")` and distinguished from genuine misses. |
There was a problem hiding this comment.
Remove unsupported “lookaheads” from regex semantics examples.
regex::bytes::Regex does not support look-around syntax, so documenting lookaheads here is inaccurate and can mislead rule authors. Keep zero-width examples to supported constructs (e.g., ^, $, a*).
As per coding guidelines: "docs/**: Review documentation quality, completeness, and accuracy. Ensure comprehensive API documentation, usage examples, and migration guides. Focus on user experience and developer onboarding."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/src/parser.md` at line 353, The docs sentence incorrectly lists
"lookaheads" as an example of zero-width matches; update the wording near the
`Value::String("")` example so it only mentions supported constructs (e.g., `^`,
`$`, `a*`) and remove any reference to look-around syntax, since
`regex::bytes::Regex` does not support lookaheads; ensure the sentence still
explains that zero-width matches are preserved as `Value::String("")` and
distinguish them from genuine misses.
Post-review fixes from PR #214 internal review round 3. Grammar: * `src/parser/grammar/mod.rs`: reject a second numeric count in a regex suffix as a hard parse error (e.g., `regex/1l2l`, `regex/1c2l`, `regex/l1l2`). Libmagic accepts these with a "multiple ranges" stderr warning; we prefer a hard error so magic-file bugs surface at parse time rather than silently using the last-seen count. Tests: * `src/parser/grammar/tests/mod.rs`: add `test_parse_type_and_operator_regex_rejects_duplicate_count` covering the three duplicate-count forms above plus a `regex/l0` rejection. * `src/evaluator/types/regex.rs`: add 8192-byte boundary regression tests: - `test_read_regex_pattern_ending_exactly_at_byte_8192_matches` -- pattern at indices 8186..8192 must match (last byte is 8191, inside the cap). - `test_read_regex_pattern_starting_at_byte_8192_misses` -- pattern at index 8192 (one byte past cap) must miss. - `test_read_regex_pattern_straddling_8192_boundary_misses` -- pattern at indices 8190..8196 (first 2 bytes inside cap, last 4 past) must miss. - `test_read_regex_line_based_respects_8192_cap` -- line mode with count=None on a 9000-byte buffer must cap at 8192. * `src/evaluator/types/regex.rs`: add classic-Mac CR terminator regression tests: - `test_read_regex_line_based_bare_cr_terminator` -- bare `\r` alone counts as a single line terminator (matches GNU `file`'s memchr2('\n', '\r') loop). - `test_read_regex_line_based_mixed_terminators` -- buffer with LF, CRLF, and CR each counting as one line; count=2 stops after the second terminator (a CRLF) so line 3 is not scanned. These close the "8192 boundary untested" and "bare-CR untested" coverage gaps flagged in round 3 review. The duplicate-count rejection closes the "silent last-wins" divergence from libmagic that was also flagged. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Post-merge cleanup of doc drift found in PR #214 review round 3. src/parser/ast.rs: * Remove the "effective byte cap of `min(n * 80, 8192)`" clause from the `TypeKind::Regex::count` rustdoc. The `n * 80` heuristic does not exist in `compute_window`; the code applies only the flat 8192 cap and walks line terminators. Reword to describe the actual behavior (8192 cap always applies regardless of count). src/evaluator/types/mod.rs: * Rewrite the `bytes_consumed_with_pattern` docstring section on pattern-bearing types. The old doc claimed Search's consumed distance "is the entire search window regardless of where the match was found" -- this is the pre-fix window-end bug, not the current match-end behavior. Rewrite to describe GNU `file` semantics: regex advances by `m.end()` (or `m.start()` with `/s`), search advances by `match_idx + pattern.len()`. Both arms fire `debug_assert!` on engine-invariant violations. * Rewrite the `TypeReadError::UnsupportedType` variant doc. The old text listed regex/date/timestamp as "reserved for future types not yet evaluatable" -- all three are now implemented, and the variant is actively used for regex pattern compile errors, missing pattern operands, and non-equality operators on pattern-bearing types. Rewrite to describe the current reporting contract. No code changes; this commit only corrects documentation that contradicts the shipped implementation. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Post-merge coverage fix from PR #214 review round 3. All 8 corpus tests in tests/regex_search_corpus_tests.rs previously used programmatic rule construction (`regex_rule`/`search_rule`), so the grammar-to-evaluator pipeline for the new regex/search flag syntax was never tested end-to-end via `parse_text_magic_file`. New tests: * `test_json1_corpus_parser_roundtrip` — parses a plain `0 regex "^\\s*[\\{\\[]" JSON text data` magic rule via `parse_text_magic_file` and evaluates it against `json1.testfile`. Exercises the default-flags, no-count regex path from text through to the evaluator. * `test_regex_flag_parser_roundtrip_case_insensitive` — same pattern with `regex/c` instead of `regex`. Confirms the `/c` flag threads through `parse_type_and_operator` into `RegexFlags::case_insensitive`. * `test_search_parser_roundtrip_with_range` — `search/32 "ABC"` parsed from text and evaluated against `searchbug.testfile`. Exercises the `NonZeroUsize` range path through the grammar layer. * `test_search_parser_rejects_bare_search` — magic-file text `search "ABC"` (no /N) must be a parse error, confirming the `NonZeroUsize` constraint is enforced at parse time, not just at programmatic construction time. These close the "parser round-trip untested for new flag syntax" coverage gap flagged in review. Existing `regex_rule`/`search_rule` programmatic tests are kept for the searchbug hierarchical scan which still depends on features the parser doesn't yet support (`use`/`name`/`offset` directives). Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
…sign Post-merge doc cleanup from PR #214 review round 3. docs/solutions/.../implementing-variable-width-typekind-variant.md: * Rewrite the `## Solution` section so the code blocks describe the final (V5) design instead of the obsolete V0 signatures. The old snippets showed `read_regex` with `start_of_line: bool`, `build_regex` with `^(?:{pattern})` wrapping, and `read_search` with `Option<usize>` — all three were removed in the completion pass but the solution doc was never updated. The `# Prevention` section was already correct; only the code blocks in `## Solution` drifted. * New code blocks describe `build_regex` (unconditional multi-line + no dot-newline), `compute_window` (8192 cap + line-terminator walk), `read_regex`/`read_search` with their current signatures (`RegexFlags`, `Option<NonZeroU32>`, `NonZeroUsize`), and the `/s`-flag anchor-advance in `regex_bytes_consumed`. * Rewrite the `## Testing` section to list the actual test coverage: line-based window tests, 8192 boundary guards, zero-width match distinction, `/s` flag tests, non-equality operator rejection, parser last-wins rejection, corpus integration tests. AGENTS.md: * Development Phases v0.3 entry used to list `/s`, `/l` line-count, `regex/Nl`, search range enforcement, and 8192 default as planned work. All are now implemented. Replace with the actual v0.3 scope: Aho-Corasick multi-pattern optimization, compiled-regex caching, and `!:mime`/`!:ext` directive evaluation. docs/src/evaluator.md: * The Type Reading section described `/l` as "multi-line start-of-line anchoring" — this is the pre-fix semantic, not the current one. Rewrite to list all three regex flags (`/c`, `/s`, `/l`) with correct meanings, note that multi-line mode is unconditional, and describe the 8192-byte hard cap. Also update the Search bullet to mention the mandatory `NonZeroUsize` range. .github/copilot-instructions.md: * Remove "regex type is planned for future releases and is not yet implemented (#39)" note — regex and search are both implemented. * Remove the `BinaryRegex` trait sketch (binary-safe matching is done via `regex::bytes::Regex` directly, not through a wrapper trait). * Update the Currently Implemented list to mention Regex and Search with their flag sets. * Update the v1.0+ Planned Features list to drop regex/search items and add Aho-Corasick + named tests. No code changes; this commit only brings documentation in line with the shipped implementation. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Closes deferred review suggestion #22 from PR #214 round-3. Before: `type_keyword_to_kind` returned a placeholder `TypeKind::Search { range: NonZeroUsize::new(1).unwrap() }` for the `search` keyword and a bare `TypeKind::Regex { .. }` for `regex`, both of which were "safe" only because grammar/mod.rs always overwrites them via direct construction before the value escapes. That's an invariant-by-convention, which is exactly the anti-pattern this PR has been trying to remove via `NonZeroUsize` and `NonZeroU32` in the AST. After: the function returns `Option<TypeKind>` and yields `None` for both `regex` and `search`. Grammar's `_ => type_keyword_to_kind(type_name)` branch now uses `.expect(...)` with an explanatory message that documents the "regex/search handled directly in grammar arms above" invariant — so a future caller who adds a new suffix-required type without wiring the grammar arm first will fail loudly at the `expect`, rather than silently producing a placeholder that gets eaten by downstream code. Tests updated: - `test_type_keyword_to_kind_*` tests wrap each expected `TypeKind` in `Some(...)`. - `test_type_keyword_to_kind_regex_and_search_return_none` is new — it pins both halves of the split invariant (parse_type_keyword recognizes the keyword, type_keyword_to_kind returns None). - `test_pstring_keyword_defaults_to_one_byte_width` uses `.expect()` on the Option return. - `test_roundtrip_all_keywords` is split into two loops: one for convertible keywords (must produce `Some`) and one for regex/search (must produce `None`), so both sides of the contract are asserted explicitly. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Closes deferred review suggestions #19 and #20 from PR #214 round-3. ## #19: RegexCount enum replaces line_based bool + count Option The previous design encoded the three regex scan-window modes across two separate fields: RegexFlags { line_based: bool, .. } TypeKind::Regex { count: Option<NonZeroU32>, .. } This admitted four `(line_based, count)` combinations while libmagic only has three semantic modes, leaving a degenerate `line_based: true, count: None` state that in practice collapsed to the same behavior as `line_based: false, count: None`. It also forced readers to remember that `count` means bytes in one branch and lines in another. Replace both fields with a single `RegexCount` enum: ```rust pub enum RegexCount { Default, // plain `regex` (8192-byte default) Bytes(NonZeroU32), // `regex/N` (byte-bounded) Lines(Option<NonZeroU32>), // `regex/l` or `regex/Nl` } ``` - The three libmagic scan modes map 1:1 to the three variants. - The degenerate `line_based: true, count: None` state is now unrepresentable: line-mode is exclusively the `Lines(_)` variant. - The byte-vs-line distinction is encoded at the type level, so match arms at every call site (`compute_window`, strength calculation, codegen, property tests) read exhaustively rather than relying on a sibling `bool`. - `Lines(None)` preserves the `regex/l` shorthand (no explicit line count), which walks terminators to the end of the 8192-byte capped window. `RegexFlags` is now just `{ case_insensitive, start_offset }`. The `/l` modifier moves from being a flag into being the `Lines` variant of `RegexCount`. Ripple: - `compute_window` (`src/evaluator/types/regex.rs`): rewrite the branch dispatch to match on `RegexCount` instead of `flags.line_based`. - `read_regex` and `regex_bytes_consumed`: take `RegexCount` instead of `(flags, Option<NonZeroU32>)`. - `bytes_consumed_with_pattern`: pass `*count` through via `Copy`. - `strength.rs`: match on `RegexCount` for the "explicit count = more specific = +5 bonus" logic. - `parser/codegen.rs::serialize_type_kind`: emit the three variants explicitly with `NonZeroU32::new(n).expect("nonzero")`. - `parser/grammar/mod.rs`: accumulate `line_based: bool` and `count_value: Option<NonZeroU32>` in the suffix loop, then collapse to the appropriate `RegexCount` variant at the end. - `tests/property_tests.rs::arb_type_kind`: generator now produces all three variants with weighted odds. - All test constructions updated across engine tests, types tests, grammar tests, corpus tests, and the evaluator_tests hierarchical rule fixture. ## #20: REGEX_MAX_BYTES relocated to evaluator::types::regex The constant is runtime evaluation policy, not AST shape — it lives where `compute_window` reads it. Moving it out of `parser::ast` also removes a latent coupling to the build-script compilation unit (`ast.rs` is shared with `build.rs` per GOTCHAS S1.1). The constant is now `pub(crate)` in `src/evaluator/types/regex.rs` with an in-comment explanation of the rationale. The rustdoc on `TypeKind::Regex::count` used to link to `[REGEX_MAX_BYTES]` via intra-doc link; rewritten to cite the "8192-byte cap" inline so the AST module has no dependency on the evaluator constant. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Closes deferred review suggestion #23 from PR #214 round-3. Before: grammar/mod.rs was 951 lines, past the 800-line hard cap documented in AGENTS.md coding-style.md. The bulk of the growth was the `parse_type_and_operator` function (over 250 lines) which handled pstring, regex, search, and the attached bitwise-AND operator inline. After: grammar/mod.rs is 793 lines. The extracted suffix parsers live in a new `grammar/type_suffix.rs` submodule (232 lines) with four functions: - `parse_pstring_suffix(input)` — the `/[BHhLl][J]?` or bare `/J` pstring length-width parser. Previously inline as a standalone fn; now lives next to its peers. - `parse_regex_suffix(input, suffix_rest)` — the "any order, last count wins but duplicates rejected" regex modifier parser. Returns `(RegexFlags, RegexCount)` with the tri-state collapse inlined. - `parse_search_suffix(input, suffix_rest)` — the mandatory `NonZeroUsize` range parser for `search/N`. - `parse_attached_operator(input)` — the post-type `&<mask>` / `&` / `&&` handler. This is the new extraction; `parse_type_and_operator` previously had ~20 lines of inline match-on-`strip_prefix('&')` logic. `parse_type_and_operator` in grammar/mod.rs is now a thin orchestrator: call `parse_type_keyword`, dispatch to the appropriate suffix parser based on `type_name`, then `parse_attached_operator`, then build the final `TypeKind` in a short match. No behavior change. The old `parse_type_and_operator` had `#[allow(clippy::too_many_lines)]`; that attribute is now removed since the function fits under the default threshold. Refs #39 Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
## Summary Follow-up cleanup for [PR #214](#214) (which closed #39). The review-round-3 findings on #214 landed after it merged, so this PR carries the fixes as a separate unit of work. All fixes are internal review findings — no user-visible behavior change beyond the grammar now hard-rejecting duplicate regex counts. ## Changes ### Parser: reject duplicate regex counts (`5ca9a6f`) - `src/parser/grammar/mod.rs`: `regex/1l2l`, `regex/1c2l`, `regex/l1l2` are now hard parse errors. Libmagic accepts them with a `"multiple ranges"` stderr warning; we prefer a hard error so magic-file bugs surface at parse time rather than silently using the last-seen count. - Test: `test_parse_type_and_operator_regex_rejects_duplicate_count`. ### Tests: 8192-byte boundary + bare-CR line terminator (`5ca9a6f`) - Exact-boundary regression tests for the `FILE_REGEX_MAX` cap: pattern ending at byte 8191 (must match), starting at 8192 (must miss), straddling the boundary (must miss), line-based scan respecting the cap. - Bare-CR (classic Mac) line terminator tests: `\r` alone counts as a single terminator, mixed LF/CRLF/CR buffer terminator walking. ### Rustdoc: fix `RegexFlags::count` and `bytes_consumed_with_pattern` drift (`5b9df07`) - `src/parser/ast.rs`: remove the "effective byte cap of `min(n * 80, 8192)`" clause from `TypeKind::Regex::count` rustdoc. No such heuristic exists in `compute_window`. - `src/evaluator/types/mod.rs`: rewrite the `bytes_consumed_with_pattern` docstring section on pattern-bearing types. The old wording claimed Search advances by the window size — that was the pre-fix bug, not current behavior. - `src/evaluator/types/mod.rs`: rewrite the `TypeReadError::UnsupportedType` variant doc. Old text listed regex/date/timestamp as "reserved for future types not yet evaluatable"; all three are now implemented. ### Tests: parser round-trip corpus tests (`648f1a8`) - `tests/regex_search_corpus_tests.rs`: 4 new tests that exercise the grammar → evaluator pipeline for regex/search flag syntax via `parse_text_magic_file`. Previously all corpus tests used programmatic rule construction, so the parser-side of the new flag syntax was never tested end-to-end. - Covers: plain `regex` rule, `regex/c` case-insensitive flag, `search/N` range, and bare `search` parse rejection. ### Docs: rewrite solution doc and clean up stale references (`aeb61d6`) - `docs/solutions/integration-issues/implementing-variable-width-typekind-variant.md`: rewrite the `## Solution` section so the code blocks describe the final (V5) design. Old snippets showed `read_regex` with `start_of_line: bool`, `build_regex` with `^(?:{pattern})` wrapping, and `read_search` with `Option<usize>` — all three were removed in the completion pass but the doc was never updated. - `AGENTS.md`: Development Phases v0.3 entry listed items that are already implemented. Replaced with actual v0.3 scope (Aho-Corasick, regex caching, `!:mime`/`!:ext` directives). - `docs/src/evaluator.md`: Type Reading section described `/l` as "multi-line start-of-line anchoring". Rewritten to list all three regex flags correctly and note the 8192-byte cap. - `.github/copilot-instructions.md`: remove "regex type is planned for future releases" note; remove `BinaryRegex` trait sketch; update Currently Implemented and Planned Features lists. ## Test Plan - [x] `mise exec -- cargo nextest run` — 1218/1218 pass (+11 new regression tests) - [x] `mise exec -- cargo clippy --all-targets -- -D warnings` — clean - [x] `mise exec -- cargo test --doc` — 167/167 doctests pass - [x] `mise exec -- cargo fmt --check` — clean - [x] `mise exec -- just ci-check` — full CI suite green locally - [ ] CodeRabbit review clean - [ ] Manual smoke: verify `regex/1l2l` returns a parse error (hit by `test_parse_type_and_operator_regex_rejects_duplicate_count`) - [ ] Manual smoke: verify a regex pattern past byte 8192 is NOT matched by an un-counted `regex` rule (hit by the three 8192 boundary tests) ## Review Findings Addressed From the round-3 review on PR #214 (after merge): | # | Finding | Status | |---|---|---| | C1 | bare-CR line terminator doc + test | ✅ docstring updated, tests added (`5ca9a6f`) | | C2 | `n * 80` claim in `TypeKind::Regex::count` rustdoc | ✅ removed (`5b9df07`) | | C3 | `bytes_consumed_with_pattern` docstring drift | ✅ rewritten (`5b9df07`) | | C4 | Solution doc `## Solution` section drift | ✅ rewritten (`aeb61d6`) | | C5 | Silent "last count wins" on `regex/1l2l` | ✅ hard parse error (`5ca9a6f`) | | I7 | Missing 8192-byte boundary tests | ✅ 4 new tests (`5ca9a6f`) | | I12 | Parser→evaluator round-trip untested for new flag syntax | ✅ 4 round-trip tests (`648f1a8`) | | I13 | `/l` = "multi-line start-of-line anchoring" in docs/src/evaluator.md | ✅ fixed (`aeb61d6`) | | I14 | AGENTS.md v0.3 roadmap stale | ✅ fixed (`aeb61d6`) | | I15 | `TypeReadError::UnsupportedType` variant doc | ✅ rewritten (`5b9df07`) | | I16 | `.github/copilot-instructions.md` stale regex reference | ✅ fixed (`aeb61d6`) | Deferred (design-level, warrant their own discussion): - Suggestion #19: `RegexCount { Default, Bytes, Lines }` enum refactor - Suggestion #20: move `REGEX_MAX_BYTES` from `ast.rs` to `evaluator::types::regex` - Suggestion #21: rename `line_based` → `line_counted` - Suggestion #22: `type_keyword_to_kind` signature change - Suggestion #23: split `src/parser/grammar/mod.rs` (now 914 lines) Refs #39 --------- Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io> Co-authored-by: dosubot[bot] <131922026+dosubot[bot]@users.noreply.github.com>
## 🤖 New release
* `libmagic-rs`: 0.5.0 -> 0.6.0 (⚠ API breaking changes)
### ⚠ `libmagic-rs` breaking changes
```text
--- failure constructible_struct_adds_field: externally-constructible struct adds field ---
Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/constructible_struct_adds_field.ron
Failed in:
field MagicRule.value_transform in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:1189
field MagicRule.value_transform in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:1189
field MagicRule.value_transform in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:1189
--- failure copy_impl_added: type now implements Copy ---
Description:
A public type now implements Copy, causing non-move closures to capture it by reference instead of moving it.
ref: rust-lang/rust#100905
impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/copy_impl_added.ron
Failed in:
libmagic_rs::mime::MimeMapper in /tmp/.tmpwFvgw1/libmagic-rs/src/mime.rs:98
--- failure enum_marked_non_exhaustive: enum marked #[non_exhaustive] ---
Description:
A public enum has been marked #[non_exhaustive]. Pattern-matching on it outside of its crate must now include a wildcard pattern like `_`, or it will fail to compile.
ref: https://doc.rust-lang.org/cargo/reference/semver.html#attr-adding-non-exhaustive
impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/enum_marked_non_exhaustive.ron
Failed in:
enum OffsetSpec in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:198
enum OffsetSpec in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:198
enum OffsetSpec in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:198
enum LibmagicError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:15
enum LibmagicError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:15
enum IoError in /tmp/.tmpwFvgw1/libmagic-rs/src/io/mod.rs:26
enum Operator in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:838
enum Operator in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:838
enum Operator in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:838
enum TypeReadError in /tmp/.tmpwFvgw1/libmagic-rs/src/evaluator/types/mod.rs:56
enum ParseError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:74
enum ParseError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:74
enum Value in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:965
enum Value in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:965
enum Value in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:965
enum TypeKind in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:398
enum TypeKind in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:398
enum TypeKind in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:398
enum EvaluationError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:148
enum EvaluationError in /tmp/.tmpwFvgw1/libmagic-rs/src/error.rs:148
--- failure enum_struct_variant_field_added: pub enum struct variant field added ---
Description:
An enum's exhaustive struct variant has a new field, which has to be included when constructing or matching on this variant.
ref: https://doc.rust-lang.org/reference/attributes/type_system.html#the-non_exhaustive-attribute
impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/enum_struct_variant_field_added.ron
Failed in:
field base_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:251
field adjustment_op of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:266
field result_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:272
field base_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:251
field adjustment_op of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:266
field result_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:272
field base_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:251
field adjustment_op of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:266
field result_relative of variant OffsetSpec::Indirect in /tmp/.tmpwFvgw1/libmagic-rs/src/parser/ast.rs:272
--- failure function_missing: pub fn removed or renamed ---
Description:
A publicly-visible function cannot be imported by its prior path. A `pub use` may have been removed, or the function itself may have been renamed or removed entirely.
ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/function_missing.ron
Failed in:
function libmagic_rs::parser::grammar::is_empty_line, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:1025
function libmagic_rs::parser::grammar::parse_strength_directive, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:846
function libmagic_rs::parser::grammar::parse_type_and_operator, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:683
function libmagic_rs::parser::grammar::parse_offset, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:179
function libmagic_rs::parser::parse_offset, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:179
function libmagic_rs::parser::grammar::parse_comment, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:1004
function libmagic_rs::parser::grammar::parse_message, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:810
function libmagic_rs::parser::grammar::parse_value, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:633
function libmagic_rs::parser::grammar::parse_number, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:133
function libmagic_rs::parser::parse_number, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:133
function libmagic_rs::parser::grammar::has_continuation, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:1060
function libmagic_rs::parser::grammar::parse_magic_rule, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:946
function libmagic_rs::parser::grammar::parse_rule_offset, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:779
function libmagic_rs::parser::grammar::is_comment_line, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:1042
function libmagic_rs::parser::grammar::is_strength_directive, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:902
function libmagic_rs::parser::grammar::parse_type, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:749
function libmagic_rs::parser::grammar::parse_operator, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:227
--- failure function_parameter_count_changed: pub fn parameter count changed ---
Description:
A publicly-visible function now takes a different number of parameters.
ref: https://doc.rust-lang.org/cargo/reference/semver.html#fn-change-arity
impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/function_parameter_count_changed.ron
Failed in:
libmagic_rs::evaluator::evaluate_single_rule now takes 3 parameters instead of 2, in /tmp/.tmpwFvgw1/libmagic-rs/src/evaluator/engine/mod.rs:196
--- failure inherent_method_missing: pub method removed or renamed ---
Description:
A publicly-visible method or associated fn is no longer available under its prior name. It may have been renamed or removed entirely.
ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/inherent_method_missing.ron
Failed in:
FileBuffer::create_symlink, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/io/mod.rs:326
EvaluationContext::increment_recursion_depth, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/evaluator/mod.rs:114
EvaluationContext::decrement_recursion_depth, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/evaluator/mod.rs:130
EvaluationContext::increment_recursion_depth, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/evaluator/mod.rs:114
EvaluationContext::decrement_recursion_depth, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/evaluator/mod.rs:130
--- failure module_missing: pub module removed or renamed ---
Description:
A publicly-visible module cannot be imported by its prior path. A `pub use` may have been removed, or the module may have been renamed, removed, or made non-public.
ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/module_missing.ron
Failed in:
mod libmagic_rs::parser::grammar, previously in file /tmp/.tmphvgzOh/libmagic-rs/src/parser/grammar/mod.rs:4
--- failure struct_marked_non_exhaustive: struct marked #[non_exhaustive] ---
Description:
A public struct has been marked #[non_exhaustive], which will prevent it from being constructed using a struct literal outside of its crate. It previously had no private fields, so a struct literal could be used to construct it outside its crate.
ref: https://doc.rust-lang.org/cargo/reference/semver.html#attr-adding-non-exhaustive
impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.46.0/src/lints/struct_marked_non_exhaustive.ron
Failed in:
struct EvaluationConfig in /tmp/.tmpwFvgw1/libmagic-rs/src/config.rs:42
```
<details><summary><i><b>Changelog</b></i></summary><p>
<blockquote>
## [0.6.0] - 2026-04-25
### Features
- **parser**: Add Date and QDate types with serialization support
([#165](#165))
- **parser**: Implement pstring (Pascal string) type
([#170](#170))
- **parser**: Implement pstring multi-byte length prefix variants (/B,
/H, /h, /L, /l, /J)
([#183](#183))
- **evaluator**: Add debug-level tracing for skipped rules
([#184](#184))
- **evaluator**: Implement indirect offset resolution
([#37](#37))
([#199](#199))
- **evaluator**: Implement relative offset resolution
([#38](#38))
([#211](#211))
- **deps**: Add new skills to actionbook/rust-skills and
trailofbits/skills
- **evaluator**: Regex and search types (closes #39)
([#214](#214))
- Implement libmagic meta-type directives and format substitution
([#42](#42))
([#230](#230))
### Bug Fixes
- **regex**: PR #214 follow-up review findings
([#215](#215))
- Load and correctly evaluate /usr/share/file/magic/filesystems and
adjacent magic files
([#233](#233))
### Documentation
- **gotchas**: Clarify requirements for adding TypeKind variants
### Miscellaneous Tasks
- Rename .coderabbitai.yaml to .coderabbit.yaml
- **Mergify**: Configuration update
([#173](#173))
- Update .gitignore to exclude local AI assistant files
- **mergify**: Upgrade configuration to current format
([#205](#205))
- Resolve all pending TODO items
([#212](#212))
- **mergify**: Upgrade configuration to current format
([#231](#231))
<!-- generated by git-cliff -->
### Security
- **io**: Close TOCTOU race in `FileBuffer::new` metadata validation
(CWE-367). `validate_file_metadata` now uses `File::metadata()` on the
open descriptor instead of re-canonicalizing the path, so an attacker
cannot swap the path between `open_file` and validation. Error paths now
report the caller-supplied path rather than the canonicalized variant.
- **cli**: Remove relative-path fallbacks from `default_magic_file_path`
(CWE-426). `./missing.magic`, `./third_party/magic.mgc`, and the
`CI`/`GITHUB_ACTIONS` env-var branch no longer resolve against the
process cwd. CI pipelines must pass `--magic <path>` explicitly.
- **evaluator**: `build_regex` now bounds `size_limit` and
`dfa_size_limit` to 1 MiB (`REGEX_COMPILE_SIZE_LIMIT`) to reject
compile-time DoS patterns (CWE-1333) from adversarial magic files.
### Features
- **parser**: Implement meta-type directives: `name`/`use` subroutines,
`default`/`clear` per-level fallback, and `indirect` re-evaluation.
`parse_text_magic_file` now returns `ParsedMagic { rules, name_table }`
(breaking change from `Vec<MagicRule>`). Named subroutines are hoisted
into `NameTable` at load time and dispatched via `RuleEnvironment` in
the evaluator. Recursion is bounded by
`EvaluationConfig::max_recursion_depth`. Resolves
[#42](#42).
- **evaluator**: Thread-local regex compile cache eliminates the
double-compile paid by every successful regex match.
`regex_bytes_consumed` now reuses the compiled `Regex` from `read_regex`
instead of recompiling the pattern to derive the anchor advance. The
cache is reset at the start of every `evaluate_rules_with_config` call,
bounding memory to one evaluation.
- **config**: `EvaluationConfig` is now `#[non_exhaustive]`; new
builder-style setters (`with_max_recursion_depth`,
`with_max_string_length`, `with_stop_at_first_match`, `with_mime_types`,
`with_timeout_ms`) let external crates construct configurations without
struct literals.
- **parser**: `MagicRule::new()` smart constructor with
`::with_children()`, `::with_strength_modifier()`, `::with_level()`
builder methods and a `::validate()` method enforcing structural
invariants (non-empty message, `level <= MAX_LEVEL`, children nested
strictly deeper than parent). New `MagicRuleValidationError` error type.
- **parser**: `RegexFlags::with_case_insensitive()` and
`::with_start_offset()` builder methods.
### Refactor
- **engine**: Extract `evaluate_pattern_rule()` and
`evaluate_value_rule()` helpers from
`evaluate_single_rule_with_anchor`'s 90-line body. Dispatch is now a
two-arm type-category split; each helper has focused rustdoc on
semantics and invariants.
- **types**: Replace the `_ =>` catch-all in
`bytes_consumed_with_pattern` with an explicit listing of the
fixed-width `TypeKind` variants. Adding a new variable-width variant
without updating this match is now a compile error instead of a silent
relative-offset anchor corruption in release builds.
- **parser**: Split the 185-line `type_keyword_to_kind` match into
per-family helpers (`byte_family`, `short_family`, `long_family`,
`quad_family`, `float_family`, `double_family`, `date_family`,
`qdate_family`, `string_family`). Drops the
`#[allow(clippy::too_many_lines)]` attribute.
- **main**: `main()` returns `std::process::ExitCode` instead of calling
`process::exit`, so destructors run on the happy path. Ctrl-C
`AtomicBool` flag uses `Ordering::Relaxed` instead of `SeqCst`.
- **grammar**: `parse_strength_directive` uses nom 8's `preceded` +
`Parser::map` instead of the legacy `map(pair(char(...), parse_number),
|(_, n)| ...)` pattern.
- **output**: Add `#[serde(skip_serializing_if = "Option::is_none",
default)]` to public `Option<T>` fields so JSON output no longer emits
`"field": null` for unset optional values.
### Documentation
- **lib**: Add `# Security` sections to
`MagicDatabase::with_builtin_rules`, `::with_builtin_rules_and_config`,
`::load_from_file`, and `::load_from_file_with_config` warning about the
unbounded default timeout and recommending
`EvaluationConfig::performance()` for untrusted input.
- **lib**: Document `MagicDatabase: Send + Sync` for parallel scanning.
- **README**: Update `TypeKind` enum example to match the current AST,
add `regex` and `search/N` to the supported types table, add pre-1.0 API
stability warning, correct the roadmap to mark v0.2-v0.4 as shipped.
- **AGENTS.md**: Relabel "Currently Implemented (v0.1.0)" and "Current
Limitations (v0.1.0)" to v0.5.0 and rewrite the Development Phases
section to reflect actual shipped scope.
### Testing
- Security regression tests for S-H1 (planted-magic-file in cwd), S-H2
(TOCTOU path-swap contract), S-M2 (pathological regex bounded runtime),
S-L2 (codegen message escape round-trip), and GOTCHAS S13.1
(`EvaluationConfig::default()` unbounded timeout invariant).
- Backspace message concatenation regression tests for first-match,
consecutive, and empty-rest edge cases.
- `MagicRule::validate()` tests covering empty message, child level
invariant, and max-depth rejection.
- `RegexCache` population/clear/reuse tests.
### Breaking Changes
- **parser**: `parse_text_magic_file` return type changed from
`Result<Vec<MagicRule>, ParseError>` to `Result<ParsedMagic,
ParseError>`. Callers must destructure `ParsedMagic { rules, name_table
}`. Low-level callers that only need the rule list can use
`parsed.rules`. `load_magic_file` and `load_magic_directory` return the
same new type.
</blockquote>
</p></details>
---
This PR was generated with
[release-plz](https://github.com/release-plz/release-plz/).
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Summary
Implements
TypeKind::RegexandTypeKind::Searchfor libmagic rule evaluation, closing the full acceptance criteria for #39. This is the complete implementation including the spec-compliance follow-up work that was flagged in pre-PR review.Highlights:
regex::bytes::Regexwith the full/c,/s,/lflag set plus optional numeric count. Scan window capped at 8192 bytes (FILE_REGEX_MAX). Multi-line mode is unconditional (matching libmagic'sREG_NEWLINE).memchr::memmem::findwith mandatoryNonZeroUsizerange (per GNUfilemagic(5)). Anchor advances to match-end, matchingsoftmagic.c::FILE_SEARCH.RegexFlagsstruct:case_insensitive,start_offset,line_based— designed as a struct so future flag additions don't cascade through 10+ exhaustive-match sites.apply_operatorand drive Equal/NotEqual from a structuredOption<Value>returned byread_pattern_match. Non-equality operators on pattern types are rejected asTypeReadError::UnsupportedTyperather than silently producing lexicographic comparisons against the pattern literal.^,a*, lookaheads) are preserved asSome(Value::String(""))and distinguished from genuine misses (None).Ground truth for every non-obvious semantic was verified against GNU
file'ssrc/softmagic.cbefore implementing (FILE_REGEX,FILE_SEARCH,moffset(),alloc_regex,parse_string_modifier,FILE_REGEX_MAX).Issue #39 Acceptance Criteria
TypeKindextended withRegex/Searchvariants/c,/s,/l, count)regex::bytes::Regexfor binary-safe matchingNonZeroUsize/c+/l+/sTest Plan
mise exec -- just ci-checkgreen (run locally: 1207/1207 tests pass, clippy clean, fmt clean, doctests pass)echo '{"ok": true}' | cargo run -- --use-builtin -still detects JSON via built-ins (regex rules not added to built-in set, so this should still fall back to string match behavior)third_party/testsfixtures on a fresh checkoutregex-eol.magicscenario passes end-to-end (exercised bytests/evaluator_tests.rs::test_regex_eol_corpuswhich was rewritten for the new flag semantics)regex/1landregex/l1both parse identically (per GNUfile"interleaved order, last range wins")search/0and baresearchare parse errors (not silent acceptance)regex/Ns-style non-equality operators are rejected (regex < "[0-9]+"should error, not match via lexicographic comparison)Notes for Reviewers
cb02472) followed by the completion commit (7294fd6). The partial commit shipped a working shell with known spec gaps; the completion commit closes those gaps (theV1–V5items flagged in internal review).filesource locations. Start there when reviewing semantics questions.compute_windowhelper insrc/evaluator/types/regex.rsis the load-bearing piece of V1 — it's where the 8192-byte cap, line-based counting, and CRLF handling all converge. The unit tests at the bottom of that file cover each code path explicitly.read_pattern_matchis the engine-facing entry point for pattern types;read_typed_value_with_patternis kept as a back-compat wrapper that collapsesNonetoValue::String(""). Future pattern-type work should go throughread_pattern_match.Closes #39