Implement normalization into canonical Formula with semantic checks by leynos · Pull Request #104 · leynos/weaver

leynos · 2026-04-11T00:06:16Z

Summary

Introduces a canonical Formula model in sempai-core and a normalization pipeline to unify legacy pattern* syntax and v2 match syntax into a single representation in sempai_core.
Adds semantic validation for invalid constructs (e.g. InvalidNotInOr, MissingPositiveTermInAnd).
Replaces the NOT_IMPLEMENTED placeholder in Engine::compile_yaml with real normalization, validation, and QueryPlan construction.
Aligns tooling, tests, and docs toward a live normalization path across core, YAML parsing, and engine layers.

Changes

Core modelling

Add canonical Formula types in sempai-core:
- Formula enum with variants: Atom, Not, Inside, Anywhere, And, Or.
- Atom enum: Pattern, Regex, TreeSitterQuery.
- Lightweight leaf wrappers: PatternAtom, RegexAtom, TreeSitterQueryAtom.
- Decorated<T> wrapper carrying where_clauses, as_name, fix, and span.
- WhereClause as an opaque container (initially storing a raw JSON value).
Expose and re-export new types from public API (e.g. pub mod formula; and re-export from sempai-core lib).

Normalization layer

Introduce normalization entry points to convert both legacy formulas and v2 match formulas into Decorated<Formula>:
- Legacy normalization maps pattern, pattern-regex, patterns, pattern-either, pattern-inside, etc. to canonical Formula structures.
- v2 normalization maps match forms (all, any, not, inside, anywhere, etc.) to canonical shapes.
- Attach WhereClauses to the enclosing Decorated<Formula> for constraints carried over from legacy forms.
- Propagate Decorated metadata (where, as, fix) during normalization.

Semantic validation

Add semantic checks for canonical formulas:
- InvalidNotInOr: reject negated branches inside disjunctions.
- MissingPositiveTermInAnd: require at least one positive term in an And.
Implement recursive validation across nested formula trees and emit deterministic diagnostics (codes like E_SEMPAI_INVALID_NOT_IN_OR, E_SEMPAI_MISSING_POSITIVE_TERM_IN_AND).

Engine integration

Wire normalization into Engine::compile_yaml:
1. Parse rules as before.
2. For each rule, normalize its principal into Decorated<Formula>.
3. Run semantic validation on the normalized formula.
4. Build and return a QueryPlan containing the normalized formula and rule metadata.
Update QueryPlan to store the Decorated<Formula> instead of the prior placeholder.

Testing

Add unit tests in:
- sempai-core for construction and equality of Formula variants.
- sempai for normalization and semantic validation scenarios.
Extend/align BDD tests to cover happy paths, negative paths (invalid-not-in-or, missing-positive-term-in-and), and parity between legacy and v2 fixtures.

Documentation

Documentation updates to reflect normalization as the live path:
- docs/sempai-query-language-design.md (note canonical Formula model and WhereClause handling).
- docs/users-guide.md (update to reflect compile_yaml returning compiled plans for valid rules).
- docs/roadmap.md (mark 4.1.5 as done).

Validation and acceptance criteria

Engine::compile_yaml returns QueryPlans with normalized Decorated<Formula> for valid rules (no more NOT_IMPLEMENTED).
Legacy and v2 fixtures normalize to structurally equivalent Formula values.
Semantic diagnostics are emitted for invalid forms (e.g. invalid not-in-or, missing positive term).
Tests pass for new normalization and semantic checks; existing tests adapt to new behavior where necessary.
Documentation and roadmap reflect the completed milestone.

How to test locally

Build docs and run checks:
- cargo doc -p sempai_core
- make lint
- make check-fmt
- make markdownlint
Run tests:
- cargo test -p sempai_core
- cargo test -p sempai
- cargo test -p sempai_yaml
Exercise engine path:
- cargo test -p sempai

Notes on compatibility and scope

Public API changes are limited to internal canonical-form representations; the existing sempai_yaml::parse_rule_file signature remains unchanged.
The new modules are designed to be incremental (Stage B–D) with Stage E–H focused on wiring, testing, and documentation.
New functionality should be covered by targeted tests and BDD scenarios before release.

📎 Task: https://www.devboxer.com/task/795de59f-09d8-463e-9ef9-350b92fa57be

This adds a new normalization step that converts legacy 'pattern*' and v2 'match' syntax into a unified canonical 'Formula' model defined in sempai_core. It includes semantic validation for structural constraints like InvalidNotInOr and MissingPositiveTermInAnd, replacing the previous NOT_IMPLEMENTED stub in Engine::compile_yaml with real normalized query plans. Legacy and v2 rules now produce structurally equivalent formulas, and invalid formulas emit deterministic diagnostics. Additional changes include new formula-related types and validation logic in sempai_core and wiring normalization into the compile pipeline. Extensive tests and documentation updates accompany this change to ensure correctness and guide users. No breaking API changes or new external dependencies introduced. Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>

coderabbitai · 2026-04-11T00:06:26Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7eb5caf1-b51b-4f68-a175-4300b18f0c6c

📥 Commits

Reviewing files that changed from the base of the PR and between ed1c94e and 5f21759.

📒 Files selected for processing (1)

crates/sempai/src/tests/engine_tests.rs

Canonical Formula Model with Normalization and Semantic Validation

This PR implements a canonical Formula model, an infallible normalization pipeline that unifies legacy pattern* and v2 match syntaxes into a single Decorated representation, and deterministic semantic checks. See design/execplan: docs/execplans/4-1-5-normalization-into-canonical-formula-model.md (new).

Core Changes

crates/sempai-core
- New module crates/sempai-core/src/formula.rs exporting the canonical model:
  - Formula enum: Atom, Not, Inside, Anywhere, And, Or.
  - Atom enum: Pattern, Regex, TreeSitterQuery with leaf structs PatternAtom, RegexAtom, TreeSitterQueryAtom.
  - Generic Decorated wrapper (where_clauses, as_name, fix, span).
  - WhereClause as an opaque container holding raw serde_json::Value.
- crates/sempai-core/src/lib.rs: exposes pub mod formula.
- crates/sempai-core/Cargo.toml: added serde_json as a regular dependency (it remains in dev-dependencies too).
Normalization (crates/sempai/src/normalize.rs)
- Added pub(crate) fn normalize_search_principal(principal: &SearchQueryPrincipal, rule_span: Option<&SourceSpan>) -> Decorated.
- Lowers legacy pattern/regex/constraint/negation constructs and v2 match forms into Decorated, attaches legacy patterns constraints as WhereClause on enclosing Decorated nodes, and propagates v2 decoration metadata (where_clauses, as_name, fix, span).
- Includes helper bare(...) to reduce duplication and a compatibility path mapping r2c-internal-project-depends-on to a degenerate Tree-sitter query atom.
Semantic validation (crates/sempai/src/semantic_check.rs)
- Added pub(crate) fn validate_formula(formula: &Decorated) -> Result<(), DiagnosticReport>.
- Enforces:
  - E_SEMPAI_INVALID_NOT_IN_OR: disallow Not directly under Or (anchored to offending span when available).
  - E_SEMPAI_MISSING_POSITIVE_TERM_IN_AND: require at least one positive term in And (anchored to branch or And span).
- Validation is recursive and deterministic.
Engine integration (crates/sempai/src/engine.rs)
- QueryPlan now stores formula: Decorated (replacing prior placeholder).
- QueryPlan::new updated (crate-visible) to accept the formula; added accessor QueryPlan::formula().
- Engine::compile_yaml now parses → normalizes → validates → constructs per-language QueryPlan(s) for search principals; unsupported languages produce schema diagnostics. Compatibility-only principals are normalized rather than short-circuited to NOT_IMPLEMENTED.

Tests and Behaviour

Added/updated tests:
- Normalization tests: crates/sempai/src/tests/normalization_tests.rs covering legacy ↔ v2 parity, unary/branch forms, constraint propagation, decoration preservation and span retention.
- Semantic validation tests: crates/sempai/src/tests/semantic_validation_tests.rs covering acceptance and the two failure modes.
- Engine and behaviour tests updated to use new QueryPlan constructor/signature and to assert compiled plans for valid search rules.
- BDD feature updated to assert successful compilation and plan counts for relevant fixtures.
Review feedback identified test and implementation duplication; small refactors and helpers (normalize_legacy/normalize_v2, assert_compile_yaml_semantic_error, bare()) were proposed and a test refactor commit was recorded to DRY engine tests.

Documentation

New execplan: docs/execplans/4-1-5-normalization-into-canonical-formula-model.md (added).
Updated docs/sempai-query-language-design.md and docs/users-guide.md to describe normalization location (crates/sempai/src/normalize.rs), semantic checks (crates/sempai/src/semantic_check.rs), diagnostic codes, and updated compile_yaml behaviour.
docs/roadmap.md: milestone 4.1.5 marked done.

Public API Surface and Compatibility

Public additions: sempai_core::formula module and types (Formula, Decorated, Atom variants) exported from sempai-core.
QueryPlan constructor signature change is crate-visible (no public API break for external crates); QueryPlan::formula() accessor added.
Normalization and validation functions remain pub(crate); parse_rule_file signature unchanged.
Behavioural change visible to users: Engine::compile_yaml returns Ok(Vec) for valid search rules (previously held a NOT_IMPLEMENTED placeholder path for some compatibility-only rules).

Diagnostics

Deterministic diagnostic codes and semantics:
- E_SEMPAI_INVALID_NOT_IN_OR — Not directly under Or (primary span when available).
- E_SEMPAI_MISSING_POSITIVE_TERM_IN_AND — And without any positive term (primary span when available).

For Reviewers

Focus review on canonical model correctness, normalization semantics (legacy → canonical and v2 → canonical), semantic-check recursion and positive-term logic and span anchoring, engine wiring in compile_yaml, and test coverage/parity with legacy fixtures. Relevant design/execplan: docs/execplans/4-1-5-normalization-into-canonical-formula-model.md.

Walkthrough

Introduce a public canonical Formula AST and Decorated wrapper; add normalization from legacy pattern* and v2 match principals into Decorated<Formula>; add semantic validation; and wire normalization/validation into Engine::compile_yaml so it returns per-language QueryPlan(s) instead of the previous NOT_IMPLEMENTED path. (50 words)

Changes

Cohort / File(s)	Summary
Core Formula Model `crates/sempai-core/Cargo.toml`, `crates/sempai-core/src/formula.rs`, `crates/sempai-core/src/lib.rs`	Add `serde_json` dependency; introduce public canonical `Formula` AST, `Atom` variants (`Pattern`/`Regex`/`TreeSitterQuery`), leaf atom structs, `Decorated<T>` wrapper with metadata, and `WhereClause`; expose `formula` module publicly.
Normalisation Module `crates/sempai/src/normalize.rs`	Add `normalize_search_principal` to lower legacy `pattern*` and v2 `match` principals into `Decorated<Formula>`, map combinators/unaries, propagate where-clauses and metadata, and represent `ProjectDependsOn` as a degenerate Tree-sitter atom.
Semantic Validation `crates/sempai/src/semantic_check.rs`	Add `validate_formula` enforcing: disallow `Not` directly in an `Or` (`ESempaiInvalidNotInOr`) and require at least one positive term per `And` (`ESempaiMissingPositiveTermInAnd`); produce diagnostics anchored to spans.
Engine Integration `crates/sempai/src/engine.rs`, `crates/sempai/src/lib.rs`	Change `QueryPlan` to store `Decorated<Formula>`; update constructor and add `formula()` accessor; implement `Engine::compile_yaml` to parse YAML, normalise search principals, validate formulas and build per-language `QueryPlan`s; unsupported languages produce schema diagnostics instead of NOT_IMPLEMENTED.
Tests `crates/sempai/src/tests/...` `crates/sempai/src/tests/normalization_tests.rs`, `crates/sempai/src/tests/semantic_validation_tests.rs`, `crates/sempai/src/tests/engine_tests.rs`, `crates/sempai/src/tests/behaviour.rs`, `crates/sempai/src/tests/mod.rs`	Add comprehensive normalisation and semantic-validation tests; update behaviour and engine tests to expect successful compilation and to construct `QueryPlan::new` with a `Decorated<Formula>` dummy where needed; assert new diagnostic codes and plan counts.
Documentation & Roadmap `docs/execplans/4-1-5-normalization-into-canonical-formula-model.md`, `docs/roadmap.md`, `docs/sempai-query-language-design.md`, `docs/users-guide.md`	Add execplan for milestone 4.1.5 and mark it complete; document where normalisation and validation live, list validation error codes, describe `r2c-internal-project-depends-on` normalisation, and update user guide to reflect compile_yaml returning QueryPlan(s).

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Engine as Engine
    participant Parser as Parser
    participant Normaliser as Normaliser
    participant Validator as Validator
    participant Compiler as Compiler
    participant QueryPlan as QueryPlan

    Client->>Engine: compile_yaml(yaml)
    Engine->>Parser: parse YAML -> Vec<Rule>
    Parser-->>Engine: rules with SearchQueryPrincipal

    loop For each search-mode Rule
        Engine->>Normaliser: normalize_search_principal(principal, span)
        Normaliser-->>Engine: Decorated<Formula>

        Engine->>Validator: validate_formula(Decorated<Formula>)
        alt Validation passes
            Validator-->>Engine: Ok(())
            Engine->>Compiler: compile_rule_plans(rule, formula)
            Compiler->>QueryPlan: new(rule_id, language, formula)
            QueryPlan-->>Compiler: QueryPlan instance
            Compiler-->>Engine: Vec(QueryPlan)
        else Validation fails
            Validator-->>Engine: Err(DiagnosticReport)
            Engine-->>Client: Err(DiagnosticReport)
        end
    end

    Engine-->>Client: Ok(Vec(QueryPlan))

Possibly related PRs

Add sempai_yaml: YAML parser front-end, engine wiring, and tests #85: Added YAML parsing and engine wiring with a NOT_IMPLEMENTED placeholder; this change replaces that placeholder by normalising into the canonical Formula and producing real QueryPlan objects.

Poem

🧩 Legacy patterns and v2 match weave tight,
Normalise and validate, send plans into flight,
Ban NOT inside OR, and demand a positive spark,
Engine compiles rules and lights up the dark,
Pipelines hum steady — the formula's alight! ✨

🚥 Pre-merge checks | ✅ 7 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	Title aligns with the main change (normalization into canonical Formula with semantic validation) but omits the roadmap reference (4.1.5) required by the title requirements.	Revise title to include roadmap reference: 'Implement normalization into canonical Formula with semantic checks (4.1.5)'
Developer Documentation	⚠️ Warning	No developers-guide.md exists to document new internal APIs (Formula, normalize.rs, semantic_check.rs), architectural boundaries, or the normalisation pipeline.	Create docs/developers-guide.md documenting the canonical Formula model, normalisation pipeline, semantic validation logic, module responsibilities, public API boundaries, and integration within the compilation flow.

✅ Passed checks (7 passed)

Check name	Status	Explanation
Description check	✅ Passed	Description thoroughly documents the summary, changes, validation criteria, and testing instructions, directly relating to the canonical Formula model and normalization pipeline introduced.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Testing	✅ Passed	All new functionality is comprehensively covered by unit tests (16 normalisation tests, 7 validation tests), integration tests, and 9 BDD scenarios exercising legacy/v2 syntax variants, constraint handling, and end-to-end compilation workflows.
User-Facing Documentation	✅ Passed	The docs/users-guide.md file comprehensively documents all new and changed user-facing functionality including the updated compile_yaml() behaviour, semantic validation, and QueryPlan returns.
Module-Level Documentation	✅ Passed	All three new modules contain comprehensive module-level docstrings with overviews and examples. All publicly exported types and functions are fully documented with descriptions and error details.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch implement-normalization-canonical-formula-6op9mx

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai · 2026-04-11T00:06:48Z

Reviewer's Guide

Implements a canonical Formula model in sempai-core and wires a full normalization + semantic validation pipeline from YAML search rules through to Engine::compile_yaml, replacing the NOT_IMPLEMENTED placeholder with real QueryPlan construction and updating tests and docs accordingly.

Sequence diagram for Engine::compile_yaml normalization and planning

sequenceDiagram
    actor User
    participant Engine
    participant YamlParser as sempai_yaml_parse_rule_file
    participant ModeValidator as validate_supported_modes
    participant Normalizer as normalize_search_principal
    participant Semantic as validate_formula
    participant Planner as compile_rule_plans

    User->>Engine: compile_yaml(yaml)
    Engine->>YamlParser: parse_rule_file(yaml, None)
    YamlParser-->>Engine: RuleFile or DiagnosticReport
    alt parse_error
        Engine-->>User: Err(DiagnosticReport)
    else parse_ok
        Engine->>ModeValidator: validate_supported_modes(file)
        ModeValidator-->>Engine: Ok or DiagnosticReport
        alt mode_error
            Engine-->>User: Err(DiagnosticReport)
        else modes_ok
            loop for each Rule in file.rules()
                Engine->>Engine: principal = rule.principal()
                alt principal_is_search
                    Engine->>Normalizer: normalize_search_principal(principal, rule.rule_span())
                    Normalizer-->>Engine: Decorated_Formula or DiagnosticReport
                    alt normalization_error
                        Engine-->>User: Err(DiagnosticReport)
                    else normalized
                        Engine->>Semantic: validate_formula(&formula)
                        Semantic-->>Engine: Ok or DiagnosticReport
                        alt semantic_error
                            Engine-->>User: Err(DiagnosticReport)
                        else validated
                            Engine->>Planner: compile_rule_plans(rule, &formula)
                            Planner-->>Engine: List~QueryPlan~ or DiagnosticReport
                            alt plan_error
                                Engine-->>User: Err(DiagnosticReport)
                            else plans_ok
                                Engine->>Engine: append plans to result
                            end
                        end
                    end
                else non_search_principal
                    Engine->>Engine: skip or unsupported mode handled earlier
                end
            end
            Engine-->>User: Ok(List~QueryPlan~)
        end
    end

Class diagram for canonical Formula model and decorators

classDiagram
    class Formula {
        +Atom atom
        +Not not
        +Inside inside
        +Anywhere anywhere
        +And and
        +Or or
    }

    class Atom {
        +Pattern pattern
        +Regex regex
        +TreeSitterQuery tree_sitter_query
    }

    class PatternAtom {
        +String text
    }

    class RegexAtom {
        +String pattern
    }

    class TreeSitterQueryAtom {
        +String query
    }

    class Decorated_Formula {
        +Formula node
        +List~WhereClause~ where_clauses
        +Option~String~ as_name
        +Option~String~ fix
        +Option~SourceSpan~ span
    }

    class WhereClause {
        +Value raw
    }

    class SourceSpan

    Formula --> Atom : uses
    Atom --> PatternAtom : has
    Atom --> RegexAtom : has
    Atom --> TreeSitterQueryAtom : has
    Decorated_Formula --> Formula : wraps
    Decorated_Formula --> WhereClause : contains
    Decorated_Formula --> SourceSpan : optional

Class diagram for Engine normalization and QueryPlan

classDiagram
    class Engine {
        -EngineConfig config
        +compile_yaml(yaml : &str) Result~List~QueryPlan~~, DiagnosticReport~
    }

    class QueryPlan {
        -String rule_id
        -Language language
        -Decorated_Formula formula
        +new(rule_id : String, language : Language, formula : Decorated_Formula) QueryPlan
        +rule_id() String
        +language() Language
        +formula() &Decorated_Formula
    }

    class Decorated_Formula {
        +Formula node
        +List~WhereClause~ where_clauses
        +Option~String~ as_name
        +Option~String~ fix
        +Option~SourceSpan~ span
    }

    class Formula
    class WhereClause
    class SourceSpan
    class EngineConfig
    class DiagnosticReport
    class Language
    class Rule {
        +id() &str
        +languages() &List~String~
        +rule_span() Option~SourceSpan~
        +principal() &RulePrincipal
    }

    class RulePrincipal {
        +Search search
        +Other other
    }

    class SearchQueryPrincipal

    class NormalizeModule {
        +normalize_search_principal(principal : &SearchQueryPrincipal, rule_span : Option~&SourceSpan~) Result~Decorated_Formula, DiagnosticReport~
    }

    class SemanticCheckModule {
        +validate_formula(formula : &Decorated_Formula) Result~(), DiagnosticReport~
    }

    Engine --> EngineConfig : has
    Engine ..> QueryPlan : creates
    Engine ..> Rule : uses
    Engine ..> NormalizeModule : calls
    Engine ..> SemanticCheckModule : calls
    Engine ..> DiagnosticReport : returns

    QueryPlan --> Language : has
    QueryPlan --> Decorated_Formula : has

    Rule --> RulePrincipal : has
    RulePrincipal --> SearchQueryPrincipal : may_wrap

    NormalizeModule ..> Decorated_Formula : returns
    NormalizeModule ..> DiagnosticReport : error

    SemanticCheckModule ..> DiagnosticReport : error

File-Level Changes

Change	Details	Files
Introduce canonical Formula AST and decoration model in sempai-core for use by normalization and engine layers.	Add formula.rs defining Formula, Atom, PatternAtom, RegexAtom, TreeSitterQueryAtom, Decorated, and WhereClause types with serde_json-backed where clauses and optional spans. Expose the new formula module from sempai-core by adding a public mod and re-exporting it. Add serde_json as a dependency of sempai-core for WhereClause raw JSON storage.	`crates/sempai-core/src/formula.rs` `crates/sempai-core/src/lib.rs` `crates/sempai-core/Cargo.toml`
Implement normalization from legacy and v2 search principals into Decorated and semantic validation over the canonical formula tree, then wire this into Engine::compile_yaml and QueryPlan.	Add normalize.rs implementing normalize_search_principal plus helpers to map LegacyFormula/MatchFormula/ProjectDependsOn into Decorated, including handling of patterns, regexes, boolean combinators, context operators, constraints as WhereClause, and a degenerate dependency formula. Add semantic_check.rs with validate_formula enforcing InvalidNotInOr and MissingPositiveTermInAnd via recursive checks that emit structured DiagnosticReports with the appropriate DiagnosticCode and spans. Refactor Engine::compile_yaml to parse rules, validate modes, normalize each Search principal to Decorated, run validate_formula, and then build per-language QueryPlans via a new compile_rule_plans helper; replace the previous NOT_IMPLEMENTED diagnostic return path. Change QueryPlan to hold a Decorated instead of the unit _plan placeholder, add a formula() accessor, and update its constructor signature and all call sites (including tests and BDD world) to pass a dummy formula where needed. Import the new normalization and semantic validation modules from lib.rs so they are available within the sempai crate.	`crates/sempai/src/normalize.rs` `crates/sempai/src/semantic_check.rs` `crates/sempai/src/engine.rs` `crates/sempai/src/lib.rs` `crates/sempai/src/tests/behaviour.rs` `crates/sempai/src/tests/engine_tests.rs`
Add focused unit tests for normalization and semantic validation behaviour over the canonical Formula model.	Create normalization_tests.rs to assert that various LegacyFormula and MatchFormula inputs normalize into the expected Formula shapes (Atoms, And/Or, Not/Inside/Anywhere, compound legacy operators like pattern-not-inside and pattern-not-regex). Create semantic_validation_tests.rs to exercise validate_formula success and failure paths, including Or with Not branches, And without positive terms, and nested Or-in-And cases. Wire the new test modules into the sempai test suite via mod declarations in tests/mod.rs and adjust existing engine tests to use dummy Decorated instances when constructing QueryPlan. Add helper functions in tests (e.g., dummy_formula) to construct minimal Decorated required by the new QueryPlan API.	`crates/sempai/src/tests/normalization_tests.rs` `crates/sempai/src/tests/semantic_validation_tests.rs` `crates/sempai/src/tests/mod.rs` `crates/sempai/src/tests/engine_tests.rs`
Update documentation and roadmap to describe the live normalization path, semantic checks, and completed milestone, and add an execution plan document for this work.	Revise the users guide to describe compile_yaml as producing canonical Formula-based QueryPlans for valid search rules, detailing semantic diagnostics and behaviour for unsupported modes and project-depends-on rules. Add an implementation note to sempai-query-language-design.md explaining where normalization and semantic validation live in the codebase, how project-depends-on is represented, and how WhereClause is currently handled as opaque JSON. Mark roadmap milestone 4.1.5 as completed and add a new execplans/4-1-5-normalization-into-canonical-formula-model.md document laying out constraints, risks, stages, interfaces, and validation criteria for this normalization work. Ensure references in docs to NOT_IMPLEMENTED placeholders are updated to reflect the new behaviour and error modes.	`docs/users-guide.md` `docs/sempai-query-language-design.md` `docs/roadmap.md` `docs/execplans/4-1-5-normalization-into-canonical-formula-model.md`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

…model - Introduce a canonical `Formula` enum in sempai-core for normalized queries. - Implement normalization from legacy (`pattern*`) and v2 (`match`) syntaxes into the shared formula model. - Add semantic validation for formulas enforcing constraints on `Or` and `And` compositions. - Update engine to compile normalized formulas into query plans per language. - Normalize compatibility-only `r2c-internal-project-depends-on` rules into a degenerate formula. - Enhance diagnostics with semantic validation errors and maintain accurate error spans. - Add extensive tests covering normalization and semantic validation. - Update documentation to reflect the new normalization and validation implementation. Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>

leynos · 2026-04-13T08:52:30Z

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix:

crates/sempai/src/tests/normalization_tests.rs

Comment on file

fn legacy_pattern_normalizes_to_atom() {
    let legacy = LegacyFormula::Pattern(String::from("foo($X)"));
    let principal = SearchQueryPrincipal::Legacy(legacy);
    let result = normalize_search_principal(&principal, None).expect("should normalize");
    assert_eq!(
        result.node,
        Formula::Atom(Atom::Pattern(PatternAtom {
            text: String::from("foo($X)")
        }))
    );
}

❌ New issue: Code Duplication
The module contains 6 functions with similar structure: legacy_pattern_normalizes_to_atom,legacy_pattern_regex_normalizes_to_regex_atom,v2_all_normalizes_to_and,v2_any_normalizes_to_or and 2 more functions

leynos · 2026-04-13T08:52:41Z

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix:

crates/sempai/src/normalize.rs

Comment on file

fn normalize_match(
    formula: &MatchFormula,
    fallback_span: Option<SourceSpan>,
) -> Decorated<Formula> {
    match formula {
        MatchFormula::Pattern(text) | MatchFormula::PatternObject(text) => Decorated {
            node: Formula::Atom(Atom::Pattern(PatternAtom { text: text.clone() })),
            where_clauses: vec![],
            as_name: None,
            fix: None,
            span: fallback_span,
        },
        MatchFormula::Regex(pattern) => Decorated {
            node: Formula::Atom(Atom::Regex(RegexAtom {
                pattern: pattern.clone(),
            })),
            where_clauses: vec![],
            as_name: None,
            fix: None,
            span: fallback_span,
        },
        MatchFormula::All(branches) => {
            let normalized_branches = branches
                .iter()
                .map(|branch| normalize_match(branch, fallback_span.clone()))
                .collect();
            Decorated {
                node: Formula::And(normalized_branches),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        MatchFormula::Any(branches) => {
            let normalized_branches = branches
                .iter()
                .map(|branch| normalize_match(branch, fallback_span.clone()))
                .collect();
            Decorated {
                node: Formula::Or(normalized_branches),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        MatchFormula::Not(inner) => {
            let normalized_inner = normalize_match(inner, fallback_span.clone());
            Decorated {
                node: Formula::Not(Box::new(normalized_inner)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        MatchFormula::Inside(inner) => {
            let normalized_inner = normalize_match(inner, fallback_span.clone());
            Decorated {
                node: Formula::Inside(Box::new(normalized_inner)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        MatchFormula::Anywhere(inner) => {
            let normalized_inner = normalize_match(inner, fallback_span.clone());
            Decorated {
                node: Formula::Anywhere(Box::new(normalized_inner)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        MatchFormula::Decorated {
            formula: inner_formula,
            where_clauses: raw_where,
            as_name,
            fix,
        } => {
            let mut normalized = normalize_match(inner_formula, fallback_span);
            normalized.where_clauses = raw_where
                .iter()
                .map(|raw| WhereClause { raw: raw.clone() })
                .collect();
            normalized.as_name.clone_from(as_name);
            normalized.fix.clone_from(fix);
            normalized
        }
    }
}

❌ New issue: Large Method
normalize_match has 94 lines, threshold = 70

leynos · 2026-04-13T08:52:56Z

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix:

crates/sempai/src/normalize.rs

Comment on file

fn normalize_legacy(
    formula: &LegacyFormula,
    fallback_span: Option<SourceSpan>,
) -> Decorated<Formula> {
    match formula {
        LegacyFormula::Pattern(text) => Decorated {
            node: Formula::Atom(Atom::Pattern(PatternAtom { text: text.clone() })),
            where_clauses: vec![],
            as_name: None,
            fix: None,
            span: fallback_span,
        },
        LegacyFormula::PatternRegex(pattern) => Decorated {
            node: Formula::Atom(Atom::Regex(RegexAtom {
                pattern: pattern.clone(),
            })),
            where_clauses: vec![],
            as_name: None,
            fix: None,
            span: fallback_span,
        },
        LegacyFormula::Patterns(clauses) => normalize_legacy_patterns(clauses, fallback_span),
        LegacyFormula::PatternEither(branches) => {
            let normalized_branches = branches
                .iter()
                .map(|branch| normalize_legacy(branch, fallback_span.clone()))
                .collect();
            Decorated {
                node: Formula::Or(normalized_branches),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        LegacyFormula::PatternNot(value) => {
            let inner = normalize_legacy_value(value, fallback_span.clone());
            Decorated {
                node: Formula::Not(Box::new(inner)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        LegacyFormula::PatternInside(value) => {
            let inner = normalize_legacy_value(value, fallback_span.clone());
            Decorated {
                node: Formula::Inside(Box::new(inner)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        LegacyFormula::PatternNotInside(value) => {
            let inner = normalize_legacy_value(value, fallback_span.clone());
            let inside = Decorated {
                node: Formula::Inside(Box::new(inner)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span.clone(),
            };
            Decorated {
                node: Formula::Not(Box::new(inside)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        LegacyFormula::PatternNotRegex(pattern) => {
            let regex_atom = Decorated {
                node: Formula::Atom(Atom::Regex(RegexAtom {
                    pattern: pattern.clone(),
                })),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span.clone(),
            };
            Decorated {
                node: Formula::Not(Box::new(regex_atom)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        LegacyFormula::Anywhere(value) => {
            let inner = normalize_legacy_value(value, fallback_span.clone());
            Decorated {
                node: Formula::Anywhere(Box::new(inner)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
    }
}

❌ New issue: Large Method
normalize_legacy has 102 lines, threshold = 70

…n length Add a private `bare()` helper function to construct Decorated nodes without metadata, replacing repeated struct literals throughout normalize.rs. This reduces `normalize_legacy` from 102 to 51 lines and `normalize_match` from 94 to 58 lines, bringing both under the 70-line clippy threshold. Changes: - Add `bare()` helper function with clear documentation - Replace all inline Decorated struct literals in normalize_legacy - Replace all inline Decorated struct literals in normalize_match - Apply bare() to normalize_legacy_value and normalize_dependency_principal - Remove #[expect(clippy::too_many_lines)] attributes no longer needed - Add #[expect(clippy::missing_const_for_fn)] to bare() (Vec requires runtime) - Fix import ordering in engine.rs per cargo fmt No logic changes. All tests pass. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Introduce helper functions to reduce repetition and improve code quality: normalize.rs changes: - Add normalize_unary() helper for Not/Inside/Anywhere arms - Add normalize_branches() helper for All/Any arms - Refactor normalize_match to use helpers, reducing from 58 to 43 lines - All helper functions properly handle Option<SourceSpan> ergonomics normalization_tests.rs changes: - Add normalize_legacy() and normalize_v2() test helpers - Eliminate repeated 3-line boilerplate from all 11 tests - Replace verbose match/panic! assertions with concise matches! macro - Reduce file from 160 to 141 lines No logic changes. All 33 tests pass. Clippy and formatting checks pass. Addresses CodeScene Large Method and code duplication violations. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

sourcery-ai

Hey - I've found 9 issues, and left some high level feedback:

In semantic_check.rs, the MissingPositiveTermInAnd diagnostic always uses the ambient span rather than the most specific available branch span, which makes error locations less precise than they could be; consider preferring the And node’s own span or one of its children’s spans when available.
In normalize_search_principal and the normalization helpers you currently never return an error but still wrap results in Result with #[expect(clippy::unnecessary_wraps)]; it would be cleaner either to introduce at least one real normalization error path now (e.g., for obviously malformed payloads) or to keep the API simple and return the bare Decorated<Formula> until you actually need fallible normalization.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `semantic_check.rs`, the `MissingPositiveTermInAnd` diagnostic always uses the ambient `span` rather than the most specific available branch span, which makes error locations less precise than they could be; consider preferring the `And` node’s own span or one of its children’s spans when available.
- In `normalize_search_principal` and the normalization helpers you currently never return an error but still wrap results in `Result` with `#[expect(clippy::unnecessary_wraps)]`; it would be cleaner either to introduce at least one real normalization error path now (e.g., for obviously malformed payloads) or to keep the API simple and return the bare `Decorated<Formula>` until you actually need fallible normalization.

## Individual Comments

### Comment 1
<location path="crates/sempai-core/src/formula.rs" line_range="40-49" />
<code_context>
+///     raw: json!({"metavariable-regex": {"metavariable": "$X", "regex": "foo.*"}}),
+/// };
+/// ```
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct WhereClause {
+    /// The raw JSON value of the constraint.
+    pub raw: Value,
</code_context>
<issue_to_address>
**issue (bug_risk):** Deriving `Eq` for `WhereClause` will fail because `serde_json::Value` does not implement `Eq`.

Because `serde_json::Value` only implements `PartialEq`, `#[derive(Eq)]` on `WhereClause` will not compile. Please either remove `Eq` (and any `WhereClause: Eq` bounds such as on `Decorated<T>`) or switch `raw` to a type that can soundly implement `Eq`.
</issue_to_address>

### Comment 2
<location path="crates/sempai/src/tests/engine_tests.rs" line_range="68" />
<code_context>

 #[test]
-fn compile_yaml_returns_not_implemented_after_successful_parse() {
+fn compile_yaml_normalizes_and_returns_query_plans() {
     let engine = default_engine();
     let result = engine.compile_yaml(
</code_context>
<issue_to_address>
**suggestion (testing):** Add tests that cover semantic validation errors surfaced through `compile_yaml` (e.g. invalid-not-in-or, missing-positive-term-in-and).

Current `compile_yaml_*` tests only cover successful cases. Since `compile_yaml` now calls `validate_formula`, please add tests that build YAML rules which normalize to:

- an `Or` with a negated branch, and
- an `And` with only constraint-style terms (`not`, `inside`, `anywhere`) and no positive term.

These should assert that `compile_yaml` returns an `Err`, and that the first diagnostic codes are `ESempaiInvalidNotInOr` and `ESempaiMissingPositiveTermInAnd`, to confirm semantic errors are exposed via the public API.

Suggested implementation:

```rust
#[test]
fn compile_yaml_normalizes_and_returns_query_plans() {
    let engine = default_engine();
    let result = engine.compile_yaml(
        "rules:\n  - id: demo.rule\n    message: oops\n    languages: [rust]\n    severity: ERROR\n    pattern: foo($X)\n",
    );
    let plans = result.expect("should successfully compile");
    assert_eq!(plans.len(), 1);
    let plan = plans.first().expect("should have first plan");
    assert_eq!(plan.rule_id(), "demo.rule");
    assert_eq!(plan.language(), Language::Rust);
}

#[test]
fn compile_yaml_returns_semantic_error_for_invalid_not_in_or() {
    let engine = default_engine();
    let result = engine.compile_yaml(
        "\
rules:
  - id: demo.invalid.not.in.or
    message: invalid not in or
    languages: [rust]
    severity: ERROR
    pattern: not foo($X) or bar($Y)
",
    );

    let error = result.expect_err("compile_yaml should surface semantic validation errors");
    let diagnostics = error.diagnostics();
    let first = diagnostics
        .first()
        .expect("semantic error should contain at least one diagnostic");
    assert_eq!(first.code(), "ESempaiInvalidNotInOr");
}

#[test]
fn compile_yaml_returns_semantic_error_for_missing_positive_term_in_and() {
    let engine = default_engine();
    let result = engine.compile_yaml(
        "\
rules:
  - id: demo.missing.positive.term.in.and
    message: missing positive term in and
    languages: [rust]
    severity: ERROR
    pattern: not foo($X) and inside(bar($Y))
",
    );

    let error = result.expect_err("compile_yaml should surface semantic validation errors");
    let diagnostics = error.diagnostics();
    let first = diagnostics
        .first()
        .expect("semantic error should contain at least one diagnostic");
    assert_eq!(first.code(), "ESempaiMissingPositiveTermInAnd");
}

```

To integrate this with the existing codebase you may need to:
1. Adjust how diagnostics are accessed from the error returned by `compile_yaml`. For example, replace `error.diagnostics()` and `first.code()` with the actual methods/fields used in your error/diagnostic types (e.g. `error.diagnostics`, `error.diagnostics()[0].code`, or `first.code.to_string()`).
2. Confirm the exact string representation or enum formatting of the diagnostic codes. If they are enums instead of strings, change the assertions to match the enum variant (e.g. `assert_eq!(first.code, DiagnosticCode::ESempaiInvalidNotInOr)`).
3. If your pattern language requires a different syntax to produce an `Or` with a negated branch or an `And` with only constraint-style terms, tweak the `pattern:` strings in the YAML to align with how `validate_formula` actually interprets formulas in your project.
</issue_to_address>

### Comment 3
<location path="crates/sempai/src/tests/normalization_tests.rs" line_range="9" />
<code_context>
+use crate::normalize::normalize_search_principal;
+
+/// Helper to normalize a legacy formula and extract the node.
+fn normalize_legacy(formula: LegacyFormula) -> Formula {
+    let principal = SearchQueryPrincipal::Legacy(formula);
+    normalize_search_principal(&principal, None)
</code_context>
<issue_to_address>
**suggestion (testing):** Add normalization tests for `patterns` (legacy `LegacyFormula::Patterns`) including constraint mapping into `where_clauses`.

Existing legacy normalization tests only cover simple `pattern` / `pattern-regex` and some negated/inside variants; they don’t exercise `LegacyFormula::Patterns`, where constraints are attached as `where_clauses` on the enclosing `And`. Please add a test that builds a `LegacyFormula::Patterns` with both formula clauses and one or more constraint clauses, and asserts that:

- the normalized node is a `Formula::And` with the expected child formulas, and
- the `Decorated` wrapper holds the expected `where_clauses` while all child formulas have empty `where_clauses`.

This will verify the constraint propagation logic in `normalize_legacy_patterns`.

Suggested implementation:

```rust
use sempai_core::formula::{Atom, Decorated, Formula, PatternAtom, RegexAtom};

```

```rust
/// Helper to normalize a legacy formula and extract the node.
fn normalize_legacy(formula: LegacyFormula) -> Formula {
    let principal = SearchQueryPrincipal::Legacy(formula);
    normalize_search_principal(&principal, None)
        .expect("should normalize")
        .node
}

/// Helper to normalize a legacy formula and keep the decorated wrapper.
///
/// This is useful for tests that need to inspect `where_clauses` or other
/// metadata attached to the normalized formula.
fn normalize_legacy_decorated(formula: LegacyFormula) -> Decorated<Formula> {
    let principal = SearchQueryPrincipal::Legacy(formula);
    normalize_search_principal(&principal, None).expect("should normalize")
}

/// Helper to normalize a v2 match formula and extract the node.
fn normalize_v2(formula: MatchFormula) -> Formula {
    let principal = SearchQueryPrincipal::Match(formula);
    normalize_search_principal(&principal, None)
        .expect("should normalize")
        .node
}

/// Normalization of `LegacyFormula::Patterns` should:
/// - produce an outer `Formula::And` node combining the pattern formulas, and
/// - attach any constraints as `where_clauses` on the outer `Decorated` node,
///   while all child formulas have empty `where_clauses`.
#[test]
fn normalize_legacy_patterns_propagates_constraints_to_where_clauses() {
    // NOTE: The exact constructors for `LegacyFormula::Pattern` / constraint
    // variants should match the rest of the tests in this module. If the
    // field names differ, adjust accordingly.
    //
    // Two simple pattern formulas:
    let pattern_formula_1 = LegacyFormula::Pattern {
        // Example pattern string; align with existing tests if they use
        // different patterns or additional fields.
        pattern: "foo($x)".to_string(),
        // Fill in any additional fields that the real `Pattern` variant
        // expects (e.g. language, is_regexp, etc.).
        ..Default::default()
    };

    let pattern_formula_2 = LegacyFormula::Pattern {
        pattern: "bar($x)".to_string(),
        ..Default::default()
    };

    // One or more constraint clauses that should end up as `where_clauses`
    // on the enclosing `And` in the normalized representation.
    //
    // This assumes there is some constraint-like legacy variant; adjust the
    // variant / fields to whatever is used elsewhere in the codebase for
    // capture constraints.
    let constraint_formula = LegacyFormula::Constraint {
        // Example: `$x` is constrained to some value.
        // Replace these fields with the actual ones from the enum.
        variable: "x".to_string(),
        value: LegacyValue::String("constrained".to_string()),
    };

    let legacy = LegacyFormula::Patterns {
        // These are the pattern formulas that should become the children
        // of the normalized `Formula::And`.
        patterns: vec![pattern_formula_1, pattern_formula_2],
        // These are the constraints that `normalize_legacy_patterns` is
        // expected to map into `where_clauses` on the enclosing `And`.
        constraints: vec![constraint_formula],
    };

    let decorated = normalize_legacy_decorated(legacy);

    // Outer node must be an `And` combining the pattern formulas.
    let children = match &decorated.node {
        Formula::And(children) => children,
        other => panic!("expected normalized legacy Patterns to be And, got {:?}", other),
    };

    // We expect exactly the two pattern children we provided above.
    assert_eq!(
        children.len(),
        2,
        "expected two child formulas in normalized And for LegacyFormula::Patterns"
    );

    // All constraints should be attached as `where_clauses` on the outer
    // decorated node.
    assert!(
        !decorated.where_clauses.is_empty(),
        "expected where_clauses on outer Decorated<Formula> for LegacyFormula::Patterns constraints"
    );

    // Child formulas themselves must not carry any where_clauses; they should
    // all be attached to the enclosing `And`.
    for (idx, child) in children.iter().enumerate() {
        assert!(
            child.where_clauses.is_empty(),
            "expected child formula {} of Patterns And to have empty where_clauses",
            idx
        );
    }
}

```

The above test assumes specific shapes for the `LegacyFormula` variants and `Decorated` type that may differ slightly from your actual implementation. To integrate cleanly you should:

1. **Adjust the `LegacyFormula` constructors**:
   - Replace `LegacyFormula::Pattern { pattern, ..Default::default() }` with the exact syntax used in your existing legacy normalization tests (same fields, same variant name).
   - Replace `LegacyFormula::Constraint { variable, value }` with the actual constraint variant used for legacy pattern constraints (for example it might be something like `LegacyFormula::Where`, `LegacyFormula::CaptureConstraint`, or similar).
   - Update the `LegacyFormula::Patterns { patterns, constraints }` fields (`patterns`, `constraints`) to match the real field names for the `Patterns` variant.

2. **Update the `Decorated` type usage**:
   - If `Decorated` lives in a different module or is re-exported under a different path, update the import (`use sempai_core::formula::{Atom, Decorated, Formula, PatternAtom, RegexAtom};`) accordingly.
   - If the fields on `Decorated` are named differently (e.g. `node` vs `formula`, `where_clauses` vs `constraints`), adjust the helper `normalize_legacy_decorated` return type and the test field accesses to match.

3. **Ensure `Default::default()` is valid**:
   - If the `LegacyFormula::Pattern` variant does not support the struct-update `..Default::default()` (e.g. it’s not `#[derive(Default)]` or doesn’t use struct syntax), replace those with complete explicit construction as used in the existing tests in this file.

4. **Align where-clause type checks**:
   - If `where_clauses` is not a `Vec<_>` or has a different accessor, update the `assert!(!decorated.where_clauses.is_empty())` and child `where_clauses` assertions to match the real API.
</issue_to_address>

### Comment 4
<location path="crates/sempai/src/tests/normalization_tests.rs" line_range="80" />
<code_context>
+}
+
+#[test]
+fn v2_pattern_shorthand_normalizes_to_atom() {
+    let result = normalize_v2(MatchFormula::Pattern(String::from("bar($Y)")));
+    assert_eq!(
</code_context>
<issue_to_address>
**suggestion (testing):** Extend v2 normalization tests to cover `MatchFormula::Decorated` metadata (where/as/fix) and span propagation.

Current v2 tests only cover the structural mapping for `pattern`/`regex`/`all`/`any`/`not`/`inside`/`anywhere`, not the `MatchFormula::Decorated { .. }` variant. Since `normalize_match` maps `where` into `where_clauses` and copies `as_name`/`fix` onto `Decorated`, please add a test that builds a decorated formula with non-empty `where`/`as`/`fix` and asserts those fields are preserved. It would also be helpful to add a test that calls `normalize_search_principal` with a non-`None` `rule_span` and verifies the resulting `span` is set, to cover span propagation through normalization.

Suggested implementation:

```rust
#[test]
fn v2_decorated_metadata_is_preserved() {
    // Build a decorated match formula with non-empty where/as/fix metadata
    let inner = MatchFormula::Pattern(String::from("foo($X)"));
    let mf = MatchFormula::Decorated {
        formula: Box::new(inner),
        // Use a minimal where-clause that your existing types support
        // (e.g. WhereClause::Raw, WhereClause::Expr, etc.)
        where_clauses: vec![WhereClause::Raw(String::from("foo == bar"))],
        as_name: Some(String::from("my_capture")),
        fix: Some(String::from("apply_fix")),
        span: None,
    };

    let result = normalize_v2(mf);

    match result {
        Formula::Decorated(decorated) => {
            assert_eq!(decorated.as_name.as_deref(), Some("my_capture"));
            assert_eq!(decorated.fix.as_deref(), Some("apply_fix"));
            assert_eq!(decorated.where_clauses.len(), 1);
            // Adjust this assertion if WhereClause doesn't implement Display/to_string
            assert_eq!(decorated.where_clauses[0].to_string(), "foo == bar");
        }
        _ => panic!("expected Decorated formula"),
    }
}

#[test]
fn v2_span_is_propagated_from_search_principal() {
    // Build a simple principal with a pattern formula
    let principal = SearchPrincipal {
        formula: MatchFormula::Pattern(String::from("foo($X)")),
        span: None,
        // Add any additional required fields here to satisfy the struct definition.
    };

    // Use whatever span type/constructor your codebase exposes for rule spans
    let rule_span = Some(Span::from_offset(0, 10));
    let normalized = normalize_search_principal(rule_span.clone(), principal);

    assert_eq!(normalized.span, rule_span);
}

#[test]
fn legacy_pattern_regex_normalizes_to_regex_atom() {

```

Because only a fragment of the file and type definitions is visible, you will need to align the new tests with your existing types:

1. **WhereClause construction**
   - Replace `WhereClause::Raw(String::from("foo == bar"))` with the concrete variant/constructor your codebase uses to represent a simple `where` clause.  
   - If `WhereClause` does not implement `Display`, replace `decorated.where_clauses[0].to_string()` with a field-level assertion that matches how `normalize_match` stores the textual predicate (e.g. `decorated.where_clauses[0].expr.source` or similar).

2. **Formula / Decorated shape**
   - If your normalized type is something like `Formula::Decorated(DecoratedFormula { formula, where_clauses, as_name, fix, .. })`, update the pattern in the `match` arm to match the real struct/field names.  
   - If `normalize_v2` returns a wrapper (e.g. `NormalizedMatch`) instead of `Formula` directly, destructure that wrapper first and then match on its `formula` field.

3. **SearchPrincipal and span type**
   - Replace the `SearchPrincipal` literal with the real struct fields required in your codebase (e.g. `rule`, `language`, etc.).  
   - Replace `Span::from_offset(0, 10)` with the constructor or literal you normally use for rule spans (e.g. `Span { start: 0, end: 10 }`, `RuleSpan::new(0, 10)`, or `TextRange::new(...)`).  
   - Adjust the field name `span` on `SearchPrincipal` and `normalized` if your struct uses a different name (e.g. `rule_span` or `range`).

4. **Import paths**
   - Ensure the test module has `use` statements for `MatchFormula`, `Formula`, `WhereClause`, `SearchPrincipal`, `Span` (or your span type), `normalize_v2`, and `normalize_search_principal` from the correct modules.
</issue_to_address>

### Comment 5
<location path="crates/sempai/src/tests/normalization_tests.rs" line_range="61" />
<code_context>
+}
+
+#[test]
+fn legacy_pattern_not_inside_normalizes_to_not_inside() {
+    let result = normalize_legacy(LegacyFormula::PatternNotInside(Box::new(
+        LegacyValue::String(String::from("class Foo:")),
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding tests for legacy `pattern-either` and `semgrep-internal-pattern-anywhere` normalization.

Current legacy tests don’t cover `LegacyFormula::PatternEither` or `LegacyFormula::Anywhere` (for `semgrep-internal-pattern-anywhere`). Please add tests that
- verify `pattern-either: [...]` normalizes to `Formula::Or` with the correct children, and
- verify `semgrep-internal-pattern-anywhere` normalizes to `Formula::Anywhere` wrapping the child formula,
so the legacy-to-canonical mapping is fully exercised.

Suggested implementation:

```rust
#[test]
fn legacy_pattern_not_inside_normalizes_to_not_inside() {
    let result = normalize_legacy(LegacyFormula::PatternNotInside(Box::new(
        LegacyValue::String(String::from("class Foo:")),
    )));
    match result {
        Formula::Not(not_inner) => match not_inner.node {
            Formula::Inside(inside_inner) => match inside_inner.node {
                Formula::Atom(Atom::Pattern(pat)) => {
                    assert_eq!(pat.text, "class Foo:");
                }
                _ => panic!("expected Pattern atom"),
            },
            _ => panic!("expected Inside formula inside Not"),
        },
        _ => panic!("expected Not formula"),
    }
}

#[test]
fn legacy_pattern_either_normalizes_to_or() {
    let result = normalize_legacy(LegacyFormula::PatternEither(vec![
        LegacyFormula::Pattern(Box::new(LegacyValue::String(String::from("first pattern")))),
        LegacyFormula::Pattern(Box::new(LegacyValue::String(String::from("second pattern")))),
    ]));

    match result {
        Formula::Or(children) => {
            assert_eq!(children.len(), 2, "expected two children in Or formula");

            match &children[0].node {
                Formula::Atom(Atom::Pattern(pat)) => {
                    assert_eq!(pat.text, "first pattern");
                }
                _ => panic!("expected first child of Or to be a Pattern atom"),
            }

            match &children[1].node {
                Formula::Atom(Atom::Pattern(pat)) => {
                    assert_eq!(pat.text, "second pattern");
                }
                _ => panic!("expected second child of Or to be a Pattern atom"),
            }
        }
        _ => panic!("expected Or formula from legacy pattern-either"),
    }
}

#[test]
fn legacy_anywhere_normalizes_to_anywhere() {
    // This corresponds to legacy `semgrep-internal-pattern-anywhere` with a single pattern child.
    let result = normalize_legacy(LegacyFormula::Anywhere(Box::new(
        LegacyFormula::Pattern(Box::new(LegacyValue::String(String::from(
            "pattern anywhere",
        )))),
    )));

    match result {
        Formula::Anywhere(inner) => match inner.node {
            Formula::Atom(Atom::Pattern(pat)) => {
                assert_eq!(pat.text, "pattern anywhere");
            }
            _ => panic!("expected Pattern atom inside Anywhere"),
        },
        _ => panic!("expected Anywhere formula from semgrep-internal-pattern-anywhere"),

```

The above edits assume the following shapes, which you should confirm in your codebase and adjust if necessary:
1. `LegacyFormula::PatternEither` takes a `Vec<LegacyFormula>`. If it instead takes a `Vec<LegacyValue>` or a different wrapper type, wrap the elements accordingly in the test.
2. `LegacyFormula::Pattern` is constructed as `LegacyFormula::Pattern(Box<LegacyValue::String(...)>)`. If the constructor signature differs, update the test calls.
3. `LegacyFormula::Anywhere` is defined as wrapping a `Box<LegacyFormula>`. If it instead wraps a `LegacyValue` directly, change the argument to `Anywhere` to match.
4. `Formula::Or` is assumed to contain a `Vec<Spanned<Formula>>` (or similar) where you access the formula via `.node`. If its inner type is different, update the pattern match on `Formula::Or(children)` and indexing of `children[0]` / `children[1]` accordingly.
Once these shapes are confirmed and aligned, the tests will fully cover normalization from legacy `pattern-either` and `semgrep-internal-pattern-anywhere` to `Formula::Or` and `Formula::Anywhere`, respectively.
</issue_to_address>

### Comment 6
<location path="crates/sempai/src/tests/semantic_validation_tests.rs" line_range="21" />
<code_context>
+}
+
+#[test]
+fn valid_or_with_positive_branches_passes() {
+    let formula = Decorated {
+        node: Formula::Or(vec![make_pattern("foo"), make_pattern("bar")]),
</code_context>
<issue_to_address>
**suggestion (testing):** Add a sanity test that a single positive atom formula passes validation untouched.

Current tests cover `And`/`Or` combinations, but not the simplest case. Please add a test where the root is a single `Atom` (e.g. `make_pattern("foo")`) and assert that `validate_formula` returns `Ok(())` to guard against future regressions on trivially valid formulas.
</issue_to_address>

### Comment 7
<location path="docs/execplans/4-1-5-normalization-into-canonical-formula-model.md" line_range="168" />
<code_context>
+produces `LegacyFormula` and `MatchFormula` ASTs in `sempai_yaml`, which are
+structurally distinct but semantically equivalent for the subset they share.
+
+The normalisation step must lower both representations into the design
+document's canonical `Formula` enum, then run semantic constraint checks before
+constructing a `QueryPlan`.
</code_context>
<issue_to_address>
**nitpick (typo):** Align spelling of "normalize/normalisation" within this document.

This section uses "normalisation" while earlier parts (including the title) use "normalize/normalization". Please choose one variant (likely the one from the title) and use it consistently in this file.

Suggested implementation:

```
The normalization step must lower both representations into the design

```

Search the rest of `docs/execplans/4-1-5-normalization-into-canonical-formula-model.md` for other variants like `normalisation`, `normalise`, and replace them with `normalization`, `normalize` respectively to keep spelling consistent with the title and earlier sections.
</issue_to_address>

### Comment 8
<location path="crates/sempai/src/normalize.rs" line_range="1" />
<code_context>
+//! Normalization of parsed query syntax into canonical formula model.
+//!
+//! This module provides the transformation from parsed
</code_context>
<issue_to_address>
**issue (review_instructions):** Add unit and behavioural tests for the normalization pipeline; the new normalization logic is currently untested.

This module implements the core normalization logic from `SearchQueryPrincipal` to the canonical `Formula` model, including legacy/v2 mappings and special handling for `r2c-internal-project-depends-on`. There are no unit tests in `crates/sempai/src/tests/normalization_tests.rs` (the file is empty in this diff) and no visible BDD coverage for the happy, unhappy, and edge paths you describe in the ExecPlan. Add unit tests that cover all branches of `normalize_legacy`, `normalize_legacy_patterns`, `normalize_match`, and `normalize_dependency_principal`, and add `rstest-bdd` feature scenarios that exercise normalization end-to-end via the public engine API.

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/*`

**Instructions:**
- For any new feature or change to an existing feature, both behavioural *and* unit tests are required.
- Tests must use `rstest-bdd` v0.5.0 for BDD scenarios covering happy, unhappy, and edge paths.

</details>
</issue_to_address>

### Comment 9
<location path="docs/execplans/4-1-5-normalization-into-canonical-formula-model.md" line_range="163" />
<code_context>
+
+## Context and orientation
+
+The Sempai query pipeline currently stops at a `NOT_IMPLEMENTED` placeholder
+after successfully parsing and mode-validating a YAML rule file.  The parser
+produces `LegacyFormula` and `MatchFormula` ASTs in `sempai_yaml`, which are
</code_context>
<issue_to_address>
**suggestion (review_instructions):** ASTs are referenced later in this section without defining the acronym.

In the paragraph beginning "The parser produces `LegacyFormula` and `MatchFormula` ASTs", please expand AST on first use, for example "abstract syntax trees (ASTs)".

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/*.md`

**Instructions:**
Define uncommon acronyms on first use.

</details>
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-04-22T08:58:37Z

+#[derive(Debug, Clone, PartialEq, Eq)]
+pub enum Formula {
+    /// A leaf pattern or regex atom.
+    Atom(Atom),
+    /// Negation: the inner formula must not match.
+    Not(Box<Decorated<Formula>>),
+    /// Context constraint: the anchor must be inside a match of the inner.
+    Inside(Box<Decorated<Formula>>),
+    /// Context constraint: the inner must match somewhere in scope.
+    Anywhere(Box<Decorated<Formula>>),


issue (bug_risk): Deriving Eq for WhereClause will fail because serde_json::Value does not implement Eq.

Because serde_json::Value only implements PartialEq, #[derive(Eq)] on WhereClause will not compile. Please either remove Eq (and any WhereClause: Eq bounds such as on Decorated<T>) or switch raw to a type that can soundly implement Eq.

sourcery-ai · 2026-04-22T08:58:37Z


 #[test]
-fn compile_yaml_returns_not_implemented_after_successful_parse() {
+fn compile_yaml_normalizes_and_returns_query_plans() {


suggestion (testing): Add tests that cover semantic validation errors surfaced through compile_yaml (e.g. invalid-not-in-or, missing-positive-term-in-and).

Current compile_yaml_* tests only cover successful cases. Since compile_yaml now calls validate_formula, please add tests that build YAML rules which normalize to:

an Or with a negated branch, and

an And with only constraint-style terms (not, inside, anywhere) and no positive term.

These should assert that compile_yaml returns an Err, and that the first diagnostic codes are ESempaiInvalidNotInOr and ESempaiMissingPositiveTermInAnd, to confirm semantic errors are exposed via the public API.

Suggested implementation:

#[test] fn compile_yaml_normalizes_and_returns_query_plans() { let engine = default_engine(); let result = engine.compile_yaml( "rules:\n - id: demo.rule\n message: oops\n languages: [rust]\n severity: ERROR\n pattern: foo($X)\n", ); let plans = result.expect("should successfully compile"); assert_eq!(plans.len(), 1); let plan = plans.first().expect("should have first plan"); assert_eq!(plan.rule_id(), "demo.rule"); assert_eq!(plan.language(), Language::Rust); } #[test] fn compile_yaml_returns_semantic_error_for_invalid_not_in_or() { let engine = default_engine(); let result = engine.compile_yaml( "\ rules: - id: demo.invalid.not.in.or message: invalid not in or languages: [rust] severity: ERROR pattern: not foo($X) or bar($Y) ", ); let error = result.expect_err("compile_yaml should surface semantic validation errors"); let diagnostics = error.diagnostics(); let first = diagnostics .first() .expect("semantic error should contain at least one diagnostic"); assert_eq!(first.code(), "ESempaiInvalidNotInOr"); } #[test] fn compile_yaml_returns_semantic_error_for_missing_positive_term_in_and() { let engine = default_engine(); let result = engine.compile_yaml( "\ rules: - id: demo.missing.positive.term.in.and message: missing positive term in and languages: [rust] severity: ERROR pattern: not foo($X) and inside(bar($Y)) ", ); let error = result.expect_err("compile_yaml should surface semantic validation errors"); let diagnostics = error.diagnostics(); let first = diagnostics .first() .expect("semantic error should contain at least one diagnostic"); assert_eq!(first.code(), "ESempaiMissingPositiveTermInAnd"); }

To integrate this with the existing codebase you may need to:

Adjust how diagnostics are accessed from the error returned by compile_yaml. For example, replace error.diagnostics() and first.code() with the actual methods/fields used in your error/diagnostic types (e.g. error.diagnostics, error.diagnostics()[0].code, or first.code.to_string()).

Confirm the exact string representation or enum formatting of the diagnostic codes. If they are enums instead of strings, change the assertions to match the enum variant (e.g. assert_eq!(first.code, DiagnosticCode::ESempaiInvalidNotInOr)).

If your pattern language requires a different syntax to produce an Or with a negated branch or an And with only constraint-style terms, tweak the pattern: strings in the YAML to align with how validate_formula actually interprets formulas in your project.

sourcery-ai · 2026-04-22T08:58:37Z

+use crate::normalize::normalize_search_principal;
+
+/// Helper to normalize a legacy formula and extract the node.
+fn normalize_legacy(formula: LegacyFormula) -> Formula {


suggestion (testing): Add normalization tests for patterns (legacy LegacyFormula::Patterns) including constraint mapping into where_clauses.

Existing legacy normalization tests only cover simple pattern / pattern-regex and some negated/inside variants; they don’t exercise LegacyFormula::Patterns, where constraints are attached as where_clauses on the enclosing And. Please add a test that builds a LegacyFormula::Patterns with both formula clauses and one or more constraint clauses, and asserts that:

the normalized node is a Formula::And with the expected child formulas, and

the Decorated wrapper holds the expected where_clauses while all child formulas have empty where_clauses.

This will verify the constraint propagation logic in normalize_legacy_patterns.

Suggested implementation:

use sempai_core::formula::{Atom, Decorated, Formula, PatternAtom, RegexAtom};

/// Helper to normalize a legacy formula and extract the node. fn normalize_legacy(formula: LegacyFormula) -> Formula { let principal = SearchQueryPrincipal::Legacy(formula); normalize_search_principal(&principal, None) .expect("should normalize") .node } /// Helper to normalize a legacy formula and keep the decorated wrapper. /// /// This is useful for tests that need to inspect `where_clauses` or other /// metadata attached to the normalized formula. fn normalize_legacy_decorated(formula: LegacyFormula) -> Decorated<Formula> { let principal = SearchQueryPrincipal::Legacy(formula); normalize_search_principal(&principal, None).expect("should normalize") } /// Helper to normalize a v2 match formula and extract the node. fn normalize_v2(formula: MatchFormula) -> Formula { let principal = SearchQueryPrincipal::Match(formula); normalize_search_principal(&principal, None) .expect("should normalize") .node } /// Normalization of `LegacyFormula::Patterns` should: /// - produce an outer `Formula::And` node combining the pattern formulas, and /// - attach any constraints as `where_clauses` on the outer `Decorated` node, /// while all child formulas have empty `where_clauses`. #[test] fn normalize_legacy_patterns_propagates_constraints_to_where_clauses() { // NOTE: The exact constructors for `LegacyFormula::Pattern` / constraint // variants should match the rest of the tests in this module. If the // field names differ, adjust accordingly. // // Two simple pattern formulas: let pattern_formula_1 = LegacyFormula::Pattern { // Example pattern string; align with existing tests if they use // different patterns or additional fields. pattern: "foo($x)".to_string(), // Fill in any additional fields that the real `Pattern` variant // expects (e.g. language, is_regexp, etc.). ..Default::default() }; let pattern_formula_2 = LegacyFormula::Pattern { pattern: "bar($x)".to_string(), ..Default::default() }; // One or more constraint clauses that should end up as `where_clauses` // on the enclosing `And` in the normalized representation. // // This assumes there is some constraint-like legacy variant; adjust the // variant / fields to whatever is used elsewhere in the codebase for // capture constraints. let constraint_formula = LegacyFormula::Constraint { // Example: `$x` is constrained to some value. // Replace these fields with the actual ones from the enum. variable: "x".to_string(), value: LegacyValue::String("constrained".to_string()), }; let legacy = LegacyFormula::Patterns { // These are the pattern formulas that should become the children // of the normalized `Formula::And`. patterns: vec![pattern_formula_1, pattern_formula_2], // These are the constraints that `normalize_legacy_patterns` is // expected to map into `where_clauses` on the enclosing `And`. constraints: vec![constraint_formula], }; let decorated = normalize_legacy_decorated(legacy); // Outer node must be an `And` combining the pattern formulas. let children = match &decorated.node { Formula::And(children) => children, other => panic!("expected normalized legacy Patterns to be And, got {:?}", other), }; // We expect exactly the two pattern children we provided above. assert_eq!( children.len(), 2, "expected two child formulas in normalized And for LegacyFormula::Patterns" ); // All constraints should be attached as `where_clauses` on the outer // decorated node. assert!( !decorated.where_clauses.is_empty(), "expected where_clauses on outer Decorated<Formula> for LegacyFormula::Patterns constraints" ); // Child formulas themselves must not carry any where_clauses; they should // all be attached to the enclosing `And`. for (idx, child) in children.iter().enumerate() { assert!( child.where_clauses.is_empty(), "expected child formula {} of Patterns And to have empty where_clauses", idx ); } }

The above test assumes specific shapes for the LegacyFormula variants and Decorated type that may differ slightly from your actual implementation. To integrate cleanly you should:

Adjust the LegacyFormula constructors:

Replace LegacyFormula::Pattern { pattern, ..Default::default() } with the exact syntax used in your existing legacy normalization tests (same fields, same variant name).

Replace LegacyFormula::Constraint { variable, value } with the actual constraint variant used for legacy pattern constraints (for example it might be something like LegacyFormula::Where, LegacyFormula::CaptureConstraint, or similar).

Update the LegacyFormula::Patterns { patterns, constraints } fields (patterns, constraints) to match the real field names for the Patterns variant.

Update the Decorated type usage:

If Decorated lives in a different module or is re-exported under a different path, update the import (use sempai_core::formula::{Atom, Decorated, Formula, PatternAtom, RegexAtom};) accordingly.

If the fields on Decorated are named differently (e.g. node vs formula, where_clauses vs constraints), adjust the helper normalize_legacy_decorated return type and the test field accesses to match.

Ensure Default::default() is valid:

If the LegacyFormula::Pattern variant does not support the struct-update ..Default::default() (e.g. it’s not #[derive(Default)] or doesn’t use struct syntax), replace those with complete explicit construction as used in the existing tests in this file.

Align where-clause type checks:

If where_clauses is not a Vec<_> or has a different accessor, update the assert!(!decorated.where_clauses.is_empty()) and child where_clauses assertions to match the real API.

sourcery-ai · 2026-04-22T08:58:38Z

+}
+
+#[test]
+fn v2_pattern_shorthand_normalizes_to_atom() {


suggestion (testing): Extend v2 normalization tests to cover MatchFormula::Decorated metadata (where/as/fix) and span propagation.

Current v2 tests only cover the structural mapping for pattern/regex/all/any/not/inside/anywhere, not the MatchFormula::Decorated { .. } variant. Since normalize_match maps where into where_clauses and copies as_name/fix onto Decorated, please add a test that builds a decorated formula with non-empty where/as/fix and asserts those fields are preserved. It would also be helpful to add a test that calls normalize_search_principal with a non-None rule_span and verifies the resulting span is set, to cover span propagation through normalization.

Suggested implementation:

#[test] fn v2_decorated_metadata_is_preserved() { // Build a decorated match formula with non-empty where/as/fix metadata let inner = MatchFormula::Pattern(String::from("foo($X)")); let mf = MatchFormula::Decorated { formula: Box::new(inner), // Use a minimal where-clause that your existing types support // (e.g. WhereClause::Raw, WhereClause::Expr, etc.) where_clauses: vec![WhereClause::Raw(String::from("foo == bar"))], as_name: Some(String::from("my_capture")), fix: Some(String::from("apply_fix")), span: None, }; let result = normalize_v2(mf); match result { Formula::Decorated(decorated) => { assert_eq!(decorated.as_name.as_deref(), Some("my_capture")); assert_eq!(decorated.fix.as_deref(), Some("apply_fix")); assert_eq!(decorated.where_clauses.len(), 1); // Adjust this assertion if WhereClause doesn't implement Display/to_string assert_eq!(decorated.where_clauses[0].to_string(), "foo == bar"); } _ => panic!("expected Decorated formula"), } } #[test] fn v2_span_is_propagated_from_search_principal() { // Build a simple principal with a pattern formula let principal = SearchPrincipal { formula: MatchFormula::Pattern(String::from("foo($X)")), span: None, // Add any additional required fields here to satisfy the struct definition. }; // Use whatever span type/constructor your codebase exposes for rule spans let rule_span = Some(Span::from_offset(0, 10)); let normalized = normalize_search_principal(rule_span.clone(), principal); assert_eq!(normalized.span, rule_span); } #[test] fn legacy_pattern_regex_normalizes_to_regex_atom() {

Because only a fragment of the file and type definitions is visible, you will need to align the new tests with your existing types:

WhereClause construction

Replace WhereClause::Raw(String::from("foo == bar")) with the concrete variant/constructor your codebase uses to represent a simple where clause.

If WhereClause does not implement Display, replace decorated.where_clauses[0].to_string() with a field-level assertion that matches how normalize_match stores the textual predicate (e.g. decorated.where_clauses[0].expr.source or similar).

Formula / Decorated shape

If your normalized type is something like Formula::Decorated(DecoratedFormula { formula, where_clauses, as_name, fix, .. }), update the pattern in the match arm to match the real struct/field names.

If normalize_v2 returns a wrapper (e.g. NormalizedMatch) instead of Formula directly, destructure that wrapper first and then match on its formula field.

SearchPrincipal and span type

Replace the SearchPrincipal literal with the real struct fields required in your codebase (e.g. rule, language, etc.).

Replace Span::from_offset(0, 10) with the constructor or literal you normally use for rule spans (e.g. Span { start: 0, end: 10 }, RuleSpan::new(0, 10), or TextRange::new(...)).

Adjust the field name span on SearchPrincipal and normalized if your struct uses a different name (e.g. rule_span or range).

Import paths

Ensure the test module has use statements for MatchFormula, Formula, WhereClause, SearchPrincipal, Span (or your span type), normalize_v2, and normalize_search_principal from the correct modules.

sourcery-ai · 2026-04-22T08:58:38Z

+}
+
+#[test]
+fn legacy_pattern_not_inside_normalizes_to_not_inside() {


suggestion (testing): Consider adding tests for legacy pattern-either and semgrep-internal-pattern-anywhere normalization.

Current legacy tests don’t cover LegacyFormula::PatternEither or LegacyFormula::Anywhere (for semgrep-internal-pattern-anywhere). Please add tests that

verify pattern-either: [...] normalizes to Formula::Or with the correct children, and

verify semgrep-internal-pattern-anywhere normalizes to Formula::Anywhere wrapping the child formula,
so the legacy-to-canonical mapping is fully exercised.

Suggested implementation:

#[test] fn legacy_pattern_not_inside_normalizes_to_not_inside() { let result = normalize_legacy(LegacyFormula::PatternNotInside(Box::new( LegacyValue::String(String::from("class Foo:")), ))); match result { Formula::Not(not_inner) => match not_inner.node { Formula::Inside(inside_inner) => match inside_inner.node { Formula::Atom(Atom::Pattern(pat)) => { assert_eq!(pat.text, "class Foo:"); } _ => panic!("expected Pattern atom"), }, _ => panic!("expected Inside formula inside Not"), }, _ => panic!("expected Not formula"), } } #[test] fn legacy_pattern_either_normalizes_to_or() { let result = normalize_legacy(LegacyFormula::PatternEither(vec![ LegacyFormula::Pattern(Box::new(LegacyValue::String(String::from("first pattern")))), LegacyFormula::Pattern(Box::new(LegacyValue::String(String::from("second pattern")))), ])); match result { Formula::Or(children) => { assert_eq!(children.len(), 2, "expected two children in Or formula"); match &children[0].node { Formula::Atom(Atom::Pattern(pat)) => { assert_eq!(pat.text, "first pattern"); } _ => panic!("expected first child of Or to be a Pattern atom"), } match &children[1].node { Formula::Atom(Atom::Pattern(pat)) => { assert_eq!(pat.text, "second pattern"); } _ => panic!("expected second child of Or to be a Pattern atom"), } } _ => panic!("expected Or formula from legacy pattern-either"), } } #[test] fn legacy_anywhere_normalizes_to_anywhere() { // This corresponds to legacy `semgrep-internal-pattern-anywhere` with a single pattern child. let result = normalize_legacy(LegacyFormula::Anywhere(Box::new( LegacyFormula::Pattern(Box::new(LegacyValue::String(String::from( "pattern anywhere", )))), ))); match result { Formula::Anywhere(inner) => match inner.node { Formula::Atom(Atom::Pattern(pat)) => { assert_eq!(pat.text, "pattern anywhere"); } _ => panic!("expected Pattern atom inside Anywhere"), }, _ => panic!("expected Anywhere formula from semgrep-internal-pattern-anywhere"),

The above edits assume the following shapes, which you should confirm in your codebase and adjust if necessary:

LegacyFormula::PatternEither takes a Vec<LegacyFormula>. If it instead takes a Vec<LegacyValue> or a different wrapper type, wrap the elements accordingly in the test.

LegacyFormula::Pattern is constructed as LegacyFormula::Pattern(Box<LegacyValue::String(...)>). If the constructor signature differs, update the test calls.

LegacyFormula::Anywhere is defined as wrapping a Box<LegacyFormula>. If it instead wraps a LegacyValue directly, change the argument to Anywhere to match.

Formula::Or is assumed to contain a Vec<Spanned<Formula>> (or similar) where you access the formula via .node. If its inner type is different, update the pattern match on Formula::Or(children) and indexing of children[0] / children[1] accordingly.
Once these shapes are confirmed and aligned, the tests will fully cover normalization from legacy pattern-either and semgrep-internal-pattern-anywhere to Formula::Or and Formula::Anywhere, respectively.

sourcery-ai · 2026-04-22T08:58:38Z

+}
+
+#[test]
+fn valid_or_with_positive_branches_passes() {


suggestion (testing): Add a sanity test that a single positive atom formula passes validation untouched.

Current tests cover And/Or combinations, but not the simplest case. Please add a test where the root is a single Atom (e.g. make_pattern("foo")) and assert that validate_formula returns Ok(()) to guard against future regressions on trivially valid formulas.

sourcery-ai · 2026-04-22T08:58:38Z

+
+## Context and orientation
+
+The Sempai query pipeline currently stops at a `NOT_IMPLEMENTED` placeholder


suggestion (review_instructions): ASTs are referenced later in this section without defining the acronym.

In the paragraph beginning "The parser produces LegacyFormula and MatchFormula ASTs", please expand AST on first use, for example "abstract syntax trees (ASTs)".

Review instructions:

Path patterns: **/*.md

Instructions:
Define uncommon acronyms on first use.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b41c038c28

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T09:01:08Z

+        file.rules()
+            .iter()
+            .filter_map(|rule| {
+                if let RulePrincipal::Search(principal) = rule.principal() {
+                    Some((rule, principal))


Align BDD scenarios with new compile_yaml success path

This change makes compile_yaml return Ok(Vec<QueryPlan>) for valid search rules, but the behavioural scenarios in crates/sempai/tests/features/sempai_engine.feature still expect NOT_IMPLEMENTED for valid YAML and r2c-internal-project-depends-on. With this new path, those scenarios now hit then_compilation_fails on successful results and fail the behaviour test suite, so CI will break unless the feature expectations are updated to assert successful compilation.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/sempai/src/tests/engine_tests.rs (1)

68-98: 🛠️ Refactor suggestion | 🟠 Major

Collapse duplicated compile-yaml normalisation tests into one parameterized test.

Unify the two near-identical tests into a single #[rstest] case table and keep one assertion body.

♻️ Proposed refactor

+use rstest::rstest;
 use crate::engine::QueryPlan;
 use crate::{
     Diagnostic, DiagnosticCode, DiagnosticReport, Engine, EngineConfig, EngineLimits, Language,
 };
 use sempai_core::formula::{Atom, Decorated, Formula, PatternAtom};
@@
-#[test]
-fn compile_yaml_normalizes_and_returns_query_plans() {
-    let engine = default_engine();
-    let result = engine.compile_yaml(
-        "rules:\n  - id: demo.rule\n    message: oops\n    languages: [rust]\n    severity: ERROR\n    pattern: foo($X)\n",
-    );
-    let plans = result.expect("should successfully compile");
-    assert_eq!(plans.len(), 1);
-    let plan = plans.first().expect("should have first plan");
-    assert_eq!(plan.rule_id(), "demo.rule");
-    assert_eq!(plan.language(), Language::Rust);
-}
-
-#[test]
-fn compile_yaml_normalizes_project_depends_on_search_rule() {
-    let engine = default_engine();
-    let result = engine.compile_yaml(concat!(
-        "rules:\n",
-        "  - id: demo.depends\n",
-        "    message: detect vulnerable dependency\n",
-        "    languages: [python]\n",
-        "    severity: WARNING\n",
-        "    r2c-internal-project-depends-on:\n",
-        "      namespace: pypi\n",
-        "      package: requests\n",
-    ));
-    let plans = result.expect("should successfully compile");
-    assert_eq!(plans.len(), 1);
-    let plan = plans.first().expect("should have first plan");
-    assert_eq!(plan.rule_id(), "demo.depends");
-    assert_eq!(plan.language(), Language::Python);
-}
+#[rstest]
+#[case(
+    "rules:\n  - id: demo.rule\n    message: oops\n    languages: [rust]\n    severity: ERROR\n    pattern: foo($X)\n",
+    "demo.rule",
+    Language::Rust
+)]
+#[case(
+    concat!(
+        "rules:\n",
+        "  - id: demo.depends\n",
+        "    message: detect vulnerable dependency\n",
+        "    languages: [python]\n",
+        "    severity: WARNING\n",
+        "    r2c-internal-project-depends-on:\n",
+        "      namespace: pypi\n",
+        "      package: requests\n",
+    ),
+    "demo.depends",
+    Language::Python
+)]
+fn compile_yaml_normalizes_and_returns_query_plans(
+    #[case] yaml: &str,
+    #[case] expected_rule_id: &str,
+    #[case] expected_language: Language,
+) {
+    let engine = default_engine();
+    let plans = engine
+        .compile_yaml(yaml)
+        .expect("should successfully compile");
+    assert_eq!(plans.len(), 1);
+    let plan = plans.first().expect("should have first plan");
+    assert_eq!(plan.rule_id(), expected_rule_id);
+    assert_eq!(plan.language(), expected_language);
+}

As per coding guidelines, "Use rstest fixtures for shared setup in Rust tests" and "Replace duplicated tests with #[rstest(...)] parameterised cases in Rust."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/sempai/src/tests/engine_tests.rs` around lines 68 - 98, Combine the
two tests compile_yaml_normalizes_and_returns_query_plans and
compile_yaml_normalizes_project_depends_on_search_rule into one parameterized
rstest that iterates cases of (yaml, expected_rule_id, expected_language); reuse
the same default_engine(), call engine.compile_yaml(yaml) and the shared
assertions plan.rule_id() == expected_rule_id and plan.language() ==
expected_language (e.g., Language::Rust and Language::Python) inside the single
test body to remove duplication; annotate the new test with #[rstest] and
provide case table entries containing the two existing YAML strings and their
expected values, keeping the same calls to default_engine and
engine.compile_yaml.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/sempai/src/normalize.rs`:
- Around line 252-264: The current normalize_dependency_principal uses a
TreeSitterQueryAtom with query "(ERROR) `@_dependency_check`" which can match real
ERROR nodes; change the query to use a node type that can never exist (e.g.
"__NONEXISTENT_NODE__") so the pattern is truly non-matchable: in
normalize_dependency_principal replace the query string in TreeSitterQueryAtom
(the call inside bare(...) that constructs
Formula::Atom(Atom::TreeSitterQuery(TreeSitterQueryAtom { query: ... }))) with a
deliberately impossible node name like "(__NONEXISTENT_NODE__)
`@_dependency_check`" (and update the inline comment to explain why).

In `@crates/sempai/src/semantic_check.rs`:
- Around line 48-76: The current check_no_not_in_or() only rejects a
Formula::Not at the branch root, allowing nested negations like Or([And([Atom,
Not])]) to slip through; add a helper fn branch_contains_not(formula: &Formula)
-> bool that recursively returns true for Formula::Not and traverses
Formula::And, Formula::Or, Formula::Inside, and Formula::Anywhere branches
(false for Atom), then in check_no_not_in_or replace the matches!(branch.node,
Formula::Not(_)) test with if branch_contains_not(&branch.node) to reject any
branch that contains a negation anywhere and keep the existing error
construction using branch.span or the outer span.
- Around line 83-125: The current is_positive_term wrongly treats any And/Or as
positive by shape; change is_positive_term (and remove const) to recursively
inspect And/Or branches so it returns true only if an Atom exists in its
descendants (i.e., match on Formula::Atom => true; Formula::And(branches) |
Formula::Or(branches) => branches.iter().any(|b| is_positive_term(&b.node));
Formula::Not | Formula::Inside | Formula::Anywhere => false), leaving
check_positive_term_in_and as-is to call the new function.

In `@docs/execplans/4-1-5-normalization-into-canonical-formula-model.md`:
- Around line 56-66: The doc text is out of sync: update the ExecPlan
description to reflect that semantic validation now lives in
crates/sempai/src/semantic_check.rs (not in sempai_core), and change the
Decorated<T> field names to match the code: where_clauses, as_name, fix, and
span: Option<SourceSpan>; also mention that sempai_core still owns Formula,
Atom, Decorated, WhereClause types, sempai_yaml owns LegacyFormula/MatchFormula,
and sempai (facade) wires parsing→normalization→validation→plan construction;
apply the same corrections to the repeated block referenced around lines
212-218.

---

Outside diff comments:
In `@crates/sempai/src/tests/engine_tests.rs`:
- Around line 68-98: Combine the two tests
compile_yaml_normalizes_and_returns_query_plans and
compile_yaml_normalizes_project_depends_on_search_rule into one parameterized
rstest that iterates cases of (yaml, expected_rule_id, expected_language); reuse
the same default_engine(), call engine.compile_yaml(yaml) and the shared
assertions plan.rule_id() == expected_rule_id and plan.language() ==
expected_language (e.g., Language::Rust and Language::Python) inside the single
test body to remove duplication; annotate the new test with #[rstest] and
provide case table entries containing the two existing YAML strings and their
expected values, keeping the same calls to default_engine and
engine.compile_yaml.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0888edfc-62dc-453b-bade-07480e3c886a

📥 Commits

Reviewing files that changed from the base of the PR and between 90ba5e4 and b41c038.

📒 Files selected for processing (16)

crates/sempai-core/Cargo.toml
crates/sempai-core/src/formula.rs
crates/sempai-core/src/lib.rs
crates/sempai/src/engine.rs
crates/sempai/src/lib.rs
crates/sempai/src/normalize.rs
crates/sempai/src/semantic_check.rs
crates/sempai/src/tests/behaviour.rs
crates/sempai/src/tests/engine_tests.rs
crates/sempai/src/tests/mod.rs
crates/sempai/src/tests/normalization_tests.rs
crates/sempai/src/tests/semantic_validation_tests.rs
docs/execplans/4-1-5-normalization-into-canonical-formula-model.md
docs/roadmap.md
docs/sempai-query-language-design.md
docs/users-guide.md

coderabbitai · 2026-04-22T09:08:05Z

+    match formula {
+        Formula::Or(branches) => {
+            for branch in branches {
+                if matches!(branch.node, Formula::Not(_)) {
+                    return Err(DiagnosticReport::validation_error(
+                        DiagnosticCode::ESempaiInvalidNotInOr,
+                        String::from(
+                            "negated terms are not allowed inside disjunction (Or/pattern-either)",
+                        ),
+                        branch.span.clone().or_else(|| span.cloned()),
+                        vec![],
+                    ));
+                }
+                // Recursively check nested formulas
+                check_no_not_in_or(&branch.node, branch.span.as_ref().or(span))?;
+            }
+            Ok(())
+        }
+        Formula::And(branches) => {
+            for branch in branches {
+                check_no_not_in_or(&branch.node, branch.span.as_ref().or(span))?;
+            }
+            Ok(())
+        }
+        Formula::Not(inner) | Formula::Inside(inner) | Formula::Anywhere(inner) => {
+            check_no_not_in_or(&inner.node, inner.span.as_ref().or(span))
+        }
+        Formula::Atom(_) => Ok(()),
+    }


⚠️ Potential issue | 🟠 Major

Reject negation anywhere inside an Or branch.

Fail the whole branch, not just Formula::Not at the top level. Or([And([Atom(..), Not(..)]), Atom(..)]) currently passes because the recursion drops the disjunction context before it reaches the nested Not.

Patch sketch

+fn branch_contains_not(formula: &Formula) -> bool { + match formula { + Formula::Not(_) => true, + Formula::And(branches) | Formula::Or(branches) => { + branches.iter().any(|branch| branch_contains_not(&branch.node)) + } + Formula::Inside(inner) | Formula::Anywhere(inner) => { + branch_contains_not(&inner.node) + } + Formula::Atom(_) => false, + } +} + fn check_no_not_in_or( formula: &Formula, span: Option<&sempai_core::SourceSpan>, ) -> Result<(), DiagnosticReport> { match formula { Formula::Or(branches) => { for branch in branches { - if matches!(branch.node, Formula::Not(_)) { + if branch_contains_not(&branch.node) { return Err(DiagnosticReport::validation_error( DiagnosticCode::ESempaiInvalidNotInOr, String::from( "negated terms are not allowed inside disjunction (Or/pattern-either)", ), branch.span.clone().or_else(|| span.cloned()), vec![], )); } check_no_not_in_or(&branch.node, branch.span.as_ref().or(span))?; } Ok(()) }

Based on learnings: "Semantic validation of normalized formulas must emit E_SEMPAI_INVALID_NOT_IN_OR when negated terms appear in disjunction branches."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/sempai/src/semantic_check.rs` around lines 48 - 76, The current check_no_not_in_or() only rejects a Formula::Not at the branch root, allowing nested negations like Or([And([Atom, Not])]) to slip through; add a helper fn branch_contains_not(formula: &Formula) -> bool that recursively returns true for Formula::Not and traverses Formula::And, Formula::Or, Formula::Inside, and Formula::Anywhere branches (false for Atom), then in check_no_not_in_or replace the matches!(branch.node, Formula::Not(_)) test with if branch_contains_not(&branch.node) to reject any branch that contains a negation anywhere and keep the existing error construction using branch.span or the outer span.

Address code review comments on the normalization implementation: Improvements to semantic_check.rs: - MissingPositiveTermInAnd diagnostic now prefers the first branch's span for a more precise error location, falling back to the enclosing And's span only when no branch has one. New tests in engine_tests.rs: - compile_yaml_returns_semantic_error_for_invalid_not_in_or: verifies that pattern-either containing a pattern-not surfaces as ESempaiInvalidNotInOr through the public API. - compile_yaml_returns_semantic_error_for_missing_positive_term_in_and: verifies that patterns with only constraint-style terms surface as ESempaiMissingPositiveTermInAnd through the public API. New tests in normalization_tests.rs: - legacy_pattern_either_normalizes_to_or: exercises pattern-either legacy-to-canonical mapping. - legacy_anywhere_normalizes_to_anywhere: exercises semgrep-internal-pattern-anywhere legacy-to-canonical mapping. - legacy_patterns_propagates_constraints_to_where_clauses: verifies that LegacyFormula::Patterns constraints are mapped onto the enclosing And's where_clauses, with child formulas having none. - v2_decorated_preserves_where_as_and_fix_metadata: verifies that MatchFormula::Decorated metadata (where, as, fix) is preserved on the canonical Decorated wrapper. - span_propagates_from_search_principal_to_decorated: verifies that the rule_span is propagated through normalization. New test in semantic_validation_tests.rs: - single_positive_atom_passes_validation: sanity check that the simplest valid formula (a single Atom) passes validate_formula. Documentation fixes: - Normalize "normalisation" to "normalization" in the ExecPlan to match the document's title and surrounding text. - Expand ASTs acronym on first use. Test count: 33 -> 41 passing tests in sempai crate. All clippy, rustfmt, markdownlint, and cargo test checks pass. Note: the code review also flagged a potential Eq-derive issue on WhereClause; this is a false positive as serde_json::Value implements Eq. The existing derive compiles cleanly and all workspace lints pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/execplans/4-1-5-normalization-into-canonical-formula-model.md`:
- Around line 584-595: The docs show an outdated Decorated<T> interface; update
the "Interfaces and dependencies" section and any other occurrences to match the
real struct fields: ensure the struct is documented with fields node,
where_clauses, as_name, fix, and span (matching the implemented Decorated<T>),
and remove or rename any doc references to old field names so the documentation
reflects the shipped API exactly (search for Decorated<T>, where_clauses,
as_name, fix, span to locate all places to change).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 784bdbf5-23c5-466f-8d60-f529b92e5629

📥 Commits

Reviewing files that changed from the base of the PR and between b41c038 and 46b9eb7.

📒 Files selected for processing (5)

crates/sempai/src/semantic_check.rs
crates/sempai/src/tests/engine_tests.rs
crates/sempai/src/tests/normalization_tests.rs
crates/sempai/src/tests/semantic_validation_tests.rs
docs/execplans/4-1-5-normalization-into-canonical-formula-model.md

Normalization has no real error paths today, so wrapping the result in Result<_, DiagnosticReport> required an `#[expect(clippy::unnecessary_wraps)]` escape hatch and forced every caller to use `?`. Addressing review feedback, return a bare `Decorated<Formula>` instead and document that the signature should grow a Result only when a mapping first needs to signal a structural error. Callers in engine.rs and the normalization tests are updated to drop the `?` / `.expect(...)` noise. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

crates/sempai/src/normalize.rs (1)

252-258: ⚠️ Potential issue | 🟠 Major

Replace the matchable (ERROR) dependency placeholder.

Fix Line 257. Stop emitting a query that can match parser recovery nodes in malformed source. Encode a guaranteed non-match placeholder path instead of a live query shape.

Safer placeholder strategy (non-match by contradiction)

-        Formula::Atom(Atom::TreeSitterQuery(TreeSitterQueryAtom {
-            query: String::from("(ERROR) `@_dependency_check`"),
-        })),
+        Formula::Atom(Atom::TreeSitterQuery(TreeSitterQueryAtom {
+            // Keep this syntactically valid while making it semantically impossible to match.
+            query: String::from(
+                "(ERROR) `@_dependency_check` (`#eq`? `@_dependency_check` \"__sempai_never__\") (`#eq`? `@_dependency_check` \"__sempai_also_never__\")",
+            ),
+        })),

In Tree-sitter query semantics, does `(ERROR) `@cap`` match parse-recovery ERROR nodes in malformed syntax trees?

Are multiple predicates on the same capture conjunctive, such that contradictory `#eq?` predicates force a no-match result?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/sempai/src/normalize.rs` around lines 252 - 258, Replace the live
Tree-sitter query string "(ERROR) `@_dependency_check`" with a guaranteed
non-matching placeholder so the rule cannot accidentally match parser-recovery
nodes; update the TreeSitterQueryAtom creation in the
bare(Formula::Atom(Atom::TreeSitterQuery(TreeSitterQueryAtom { query: ... })))
expression to use a non-existent node name or an explicitly contradictory
predicate (for example a node type like "__never_node__" or a capture with a
contradiction) instead of "(ERROR) `@_dependency_check`" so the query is
impossible to satisfy at parse time.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/sempai/src/engine.rs`:
- Around line 17-39: The code currently clones Decorated<Formula> for each
language in compile_rule_plans; change QueryPlan to hold Arc<Decorated<Formula>>
and update compile_rule_plans to wrap the incoming formula in Arc::new(...) once
and clone the Arc for each QueryPlan instead of deep-cloning the formula. Update
other call sites mentioned (the blocks at the other two places referenced) to
construct QueryPlan with Arc<Decorated<Formula>> as well, and adjust
signature/usages of QueryPlan::new and any fields to accept
Arc<Decorated<Formula>> so the formula is shared across per-language plans.

In `@crates/sempai/src/normalize.rs`:
- Line 25: Update the malformed module doc bullet that currently reads //! -
`regex: "...` → `Formula::Atom(Atom::Regex(...))` by closing the inline code
literal (change it to //! - `regex: "..."` → `Formula::Atom(Atom::Regex(...))`)
so the backtick pair is balanced; locate the module-level doc comment containing
that snippet in normalize.rs and fix the punctuation to keep generated docs
syntactically valid.

In `@crates/sempai/src/tests/normalization_tests.rs`:
- Around line 260-267: Add a new unit test that constructs a
SearchQueryPrincipal::ProjectDependsOn variant and calls
normalize_search_principal with a non‑None SourceSpan to ensure the
ProjectDependsOn branch is exercised; assert the returned normalized principal
has the canonical shape you expect for ProjectDependsOn (match the same
structural form used elsewhere in tests) and that normalized.span == Some(span)
to lock metadata propagation for this branch. Locate the normalization
entrypoint normalize_search_principal and mirror the style of the existing
span_propagates_from_search_principal_to_decorated test: create a SourceSpan,
build a ProjectDependsOn principal value, call
normalize_search_principal(&principal, Some(&span)), then add explicit
assertions for both the normalized principal variant/contents and the span
propagation.
- Around line 37-154: Multiple small tests repeat the same
normalize_legacy/normalize_v2 assertions; replace these duplicated single-case
tests (e.g., legacy_pattern_normalizes_to_atom,
legacy_pattern_regex_normalizes_to_regex_atom,
v2_pattern_shorthand_normalizes_to_atom, v2_regex_normalizes_to_regex_atom,
v2_all_normalizes_to_and, v2_any_normalizes_to_or, v2_not_normalizes_to_not,
v2_inside_normalizes_to_inside, v2_anywhere_normalizes_to_anywhere) with
parameterized rstest table cases that iterate inputs
(LegacyFormula::Pattern/PatternRegex/... and
MatchFormula::Pattern/Regex/All/Any/Not/Inside/Anywhere) and expected normalized
predicate types from normalize_legacy and normalize_v2; keep the
nested-structure tests (like legacy_pattern_not_regex_normalizes_to_not_regex
and legacy_pattern_not_inside_normalizes_to_not_inside) as bespoke tests. Use
#[rstest] with case(...) entries and assert matches!(result, ...) or equality as
appropriate, referencing normalize_legacy and normalize_v2 in the fixture.

---

Duplicate comments:
In `@crates/sempai/src/normalize.rs`:
- Around line 252-258: Replace the live Tree-sitter query string "(ERROR)
`@_dependency_check`" with a guaranteed non-matching placeholder so the rule
cannot accidentally match parser-recovery nodes; update the TreeSitterQueryAtom
creation in the bare(Formula::Atom(Atom::TreeSitterQuery(TreeSitterQueryAtom {
query: ... }))) expression to use a non-existent node name or an explicitly
contradictory predicate (for example a node type like "__never_node__" or a
capture with a contradiction) instead of "(ERROR) `@_dependency_check`" so the
query is impossible to satisfy at parse time.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c1d318bc-36dd-49fe-aafc-3c221f1544a6

📥 Commits

Reviewing files that changed from the base of the PR and between 46b9eb7 and bd3f5ee.

📒 Files selected for processing (3)

crates/sempai/src/engine.rs
crates/sempai/src/normalize.rs
crates/sempai/src/tests/normalization_tests.rs

coderabbitai · 2026-04-23T23:13:48Z

+fn compile_rule_plans(
+    rule: &Rule,
+    formula: &Decorated<Formula>,
+) -> Result<Vec<QueryPlan>, DiagnosticReport> {
+    rule.languages()
+        .iter()
+        .map(|lang_str| {
+            let language = lang_str.parse::<Language>().map_err(|e| {
+                DiagnosticReport::validation_error(
+                    DiagnosticCode::ESempaiSchemaInvalid,
+                    format!("unsupported language '{lang_str}': {e}"),
+                    rule.rule_span().cloned(),
+                    vec![],
+                )
+            })?;
+            Ok(QueryPlan::new(
+                rule.id().to_owned(),
+                language,
+                formula.clone(),
+            ))
+        })
+        .collect()
+}


🧹 Nitpick | 🔵 Trivial

Share canonical formula instances across per-language plans.

Stop deep-cloning Decorated<Formula> for every language. Store the formula as Arc<Decorated<Formula>> in QueryPlan and clone the Arc in compile_rule_plans.

Also applies to: 58-63, 72-77

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/sempai/src/engine.rs` around lines 17 - 39, The code currently clones Decorated<Formula> for each language in compile_rule_plans; change QueryPlan to hold Arc<Decorated<Formula>> and update compile_rule_plans to wrap the incoming formula in Arc::new(...) once and clone the Arc for each QueryPlan instead of deep-cloning the formula. Update other call sites mentioned (the blocks at the other two places referenced) to construct QueryPlan with Arc<Decorated<Formula>> as well, and adjust signature/usages of QueryPlan::new and any fields to accept Arc<Decorated<Formula>> so the formula is shared across per-language plans.

coderabbitai · 2026-04-23T23:13:48Z

+//!
+//! - `"..."` (string shorthand) → `Formula::Atom(Atom::Pattern(...))`
+//! - `pattern: "..."` → `Formula::Atom(Atom::Pattern(...))`
+//! - `regex: "...` → `Formula::Atom(Atom::Regex(...))`


⚠️ Potential issue | 🟡 Minor

Fix the malformed v2 regex doc bullet.

Correct Line 25 to close the inline literal (\regex: "..."``). Keep module docs syntactically clean for generated documentation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/sempai/src/normalize.rs` at line 25, Update the malformed module doc bullet that currently reads //! - `regex: "...` → `Formula::Atom(Atom::Regex(...))` by closing the inline code literal (change it to //! - `regex: "..."` → `Formula::Atom(Atom::Regex(...))`) so the backtick pair is balanced; locate the module-level doc comment containing that snippet in normalize.rs and fix the punctuation to keep generated docs syntactically valid.

coderabbitai · 2026-04-23T23:13:48Z

+#[test]
+fn legacy_pattern_normalizes_to_atom() {
+    let result = normalize_legacy(LegacyFormula::Pattern(String::from("foo($X)")));
+    assert_eq!(
+        result,
+        Formula::Atom(Atom::Pattern(PatternAtom {
+            text: String::from("foo($X)")
+        }))
+    );
+}
+
+#[test]
+fn legacy_pattern_regex_normalizes_to_regex_atom() {
+    let result = normalize_legacy(LegacyFormula::PatternRegex(String::from(r"foo_\d+")));
+    assert_eq!(
+        result,
+        Formula::Atom(Atom::Regex(RegexAtom {
+            pattern: String::from(r"foo_\d+")
+        }))
+    );
+}
+
+#[test]
+fn legacy_pattern_not_regex_normalizes_to_not_regex() {
+    let result = normalize_legacy(LegacyFormula::PatternNotRegex(String::from(r"bar.*")));
+    match result {
+        Formula::Not(inner) => match inner.node {
+            Formula::Atom(Atom::Regex(regex)) => {
+                assert_eq!(regex.pattern, r"bar.*");
+            }
+            _ => panic!("expected Regex atom inside Not"),
+        },
+        _ => panic!("expected Not formula"),
+    }
+}
+
+#[test]
+fn legacy_pattern_not_inside_normalizes_to_not_inside() {
+    let result = normalize_legacy(LegacyFormula::PatternNotInside(Box::new(
+        LegacyValue::String(String::from("class Foo:")),
+    )));
+    match result {
+        Formula::Not(not_inner) => match not_inner.node {
+            Formula::Inside(inside_inner) => match inside_inner.node {
+                Formula::Atom(Atom::Pattern(pat)) => {
+                    assert_eq!(pat.text, "class Foo:");
+                }
+                _ => panic!("expected Pattern atom"),
+            },
+            _ => panic!("expected Inside formula inside Not"),
+        },
+        _ => panic!("expected Not formula"),
+    }
+}
+
+#[test]
+fn v2_pattern_shorthand_normalizes_to_atom() {
+    let result = normalize_v2(MatchFormula::Pattern(String::from("bar($Y)")));
+    assert_eq!(
+        result,
+        Formula::Atom(Atom::Pattern(PatternAtom {
+            text: String::from("bar($Y)")
+        }))
+    );
+}
+
+#[test]
+fn v2_regex_normalizes_to_regex_atom() {
+    let result = normalize_v2(MatchFormula::Regex(String::from(r"baz_\w+")));
+    assert_eq!(
+        result,
+        Formula::Atom(Atom::Regex(RegexAtom {
+            pattern: String::from(r"baz_\w+")
+        }))
+    );
+}
+
+#[test]
+fn v2_all_normalizes_to_and() {
+    let result = normalize_v2(MatchFormula::All(vec![
+        MatchFormula::Pattern(String::from("foo")),
+        MatchFormula::Pattern(String::from("bar")),
+    ]));
+    assert!(matches!(result, Formula::And(ref branches) if branches.len() == 2));
+}
+
+#[test]
+fn v2_any_normalizes_to_or() {
+    let result = normalize_v2(MatchFormula::Any(vec![
+        MatchFormula::Pattern(String::from("foo")),
+        MatchFormula::Pattern(String::from("bar")),
+    ]));
+    assert!(matches!(result, Formula::Or(ref branches) if branches.len() == 2));
+}
+
+#[test]
+fn v2_not_normalizes_to_not() {
+    let result = normalize_v2(MatchFormula::Not(Box::new(MatchFormula::Pattern(
+        String::from("baz"),
+    ))));
+    assert!(matches!(result, Formula::Not(_)));
+}
+
+#[test]
+fn v2_inside_normalizes_to_inside() {
+    let result = normalize_v2(MatchFormula::Inside(Box::new(MatchFormula::Pattern(
+        String::from("class X:"),
+    ))));
+    assert!(matches!(result, Formula::Inside(_)));
+}
+
+#[test]
+fn v2_anywhere_normalizes_to_anywhere() {
+    let result = normalize_v2(MatchFormula::Anywhere(Box::new(MatchFormula::Pattern(
+        String::from("unsafe"),
+    ))));
+    assert!(matches!(result, Formula::Anywhere(_)));
+}


🛠️ Refactor suggestion | 🟠 Major

Parameterize duplicated normalisation tests with rstest.

Replace the repeated single-case tests with #[rstest] table cases and keep bespoke tests only for nested-structure assertions.

Refactor pattern to remove duplication

+use rstest::rstest; + #[test] -fn v2_pattern_shorthand_normalizes_to_atom() { - let result = normalize_v2(MatchFormula::Pattern(String::from("bar($Y)"))); - assert_eq!( - result, - Formula::Atom(Atom::Pattern(PatternAtom { - text: String::from("bar($Y)") - })) - ); -} - -#[test] -fn v2_regex_normalizes_to_regex_atom() { - let result = normalize_v2(MatchFormula::Regex(String::from(r"baz_\w+"))); - assert_eq!( - result, - Formula::Atom(Atom::Regex(RegexAtom { - pattern: String::from(r"baz_\w+") - })) - ); +#[rstest] +#[case( + MatchFormula::Pattern(String::from("bar($Y)")), + Formula::Atom(Atom::Pattern(PatternAtom { text: String::from("bar($Y)") })) +)] +#[case( + MatchFormula::Regex(String::from(r"baz_\w+")), + Formula::Atom(Atom::Regex(RegexAtom { pattern: String::from(r"baz_\w+") })) +)] +fn v2_leaf_forms_normalize(#[case] input: MatchFormula, #[case] expected: Formula) { + let result = normalize_v2(input); + assert_eq!(result, expected); }

As per coding guidelines: "Use rstest fixtures for shared setup in Rust tests." and "Replace duplicated tests with #[rstest(...)] parameterised cases in Rust."

Also applies to: 156-193

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/sempai/src/tests/normalization_tests.rs` around lines 37 - 154, Multiple small tests repeat the same normalize_legacy/normalize_v2 assertions; replace these duplicated single-case tests (e.g., legacy_pattern_normalizes_to_atom, legacy_pattern_regex_normalizes_to_regex_atom, v2_pattern_shorthand_normalizes_to_atom, v2_regex_normalizes_to_regex_atom, v2_all_normalizes_to_and, v2_any_normalizes_to_or, v2_not_normalizes_to_not, v2_inside_normalizes_to_inside, v2_anywhere_normalizes_to_anywhere) with parameterized rstest table cases that iterate inputs (LegacyFormula::Pattern/PatternRegex/... and MatchFormula::Pattern/Regex/All/Any/Not/Inside/Anywhere) and expected normalized predicate types from normalize_legacy and normalize_v2; keep the nested-structure tests (like legacy_pattern_not_regex_normalizes_to_not_regex and legacy_pattern_not_inside_normalizes_to_not_inside) as bespoke tests. Use #[rstest] with case(...) entries and assert matches!(result, ...) or equality as appropriate, referencing normalize_legacy and normalize_v2 in the fixture.

coderabbitai · 2026-04-23T23:13:49Z

+#[test]
+fn span_propagates_from_search_principal_to_decorated() {
+    let span = SourceSpan::new(5, 42, Some(String::from("file:///rule.yaml")));
+    let principal = SearchQueryPrincipal::Match(MatchFormula::Pattern(String::from("foo($X)")));
+    let normalized = normalize_search_principal(&principal, Some(&span));
+
+    assert_eq!(normalized.span, Some(span));
+}


⚠️ Potential issue | 🟡 Minor

Add explicit coverage for the ProjectDependsOn normalisation branch.

Add a regression test that constructs SearchQueryPrincipal::ProjectDependsOn(...) and asserts the canonical output shape and metadata propagation. Lock this branch with a dedicated assertion to prevent silent regressions in normalize_search_principal dispatch coverage.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/sempai/src/tests/normalization_tests.rs` around lines 260 - 267, Add a new unit test that constructs a SearchQueryPrincipal::ProjectDependsOn variant and calls normalize_search_principal with a non‑None SourceSpan to ensure the ProjectDependsOn branch is exercised; assert the returned normalized principal has the canonical shape you expect for ProjectDependsOn (match the same structural form used elsewhere in tests) and that normalized.span == Some(span) to lock metadata propagation for this branch. Locate the normalization entrypoint normalize_search_principal and mirror the style of the existing span_propagates_from_search_principal_to_decorated test: create a SourceSpan, build a ProjectDependsOn principal value, call normalize_search_principal(&principal, Some(&span)), then add explicit assertions for both the normalized principal variant/contents and the span propagation.

leynos · 2026-04-24T09:41:41Z

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response.

Code Duplication

crates/sempai/src/tests/engine_tests.rs:

What lead to degradation?

The module contains 2 functions with similar structure: compile_yaml_returns_semantic_error_for_invalid_not_in_or,compile_yaml_returns_semantic_error_for_missing_positive_term_in_and

Why does this problem occur?

Duplicated code often leads to code that's harder to change since the same logical change has to be done in multiple functions. More duplication gives lower code health.

How to fix it?

A certain degree of duplicated code might be acceptable. The problems start when it is the same behavior that is duplicated across the functions in the module, ie. a violation of the Don't Repeat Yourself (DRY) principle. DRY violations lead to code that is changed together in predictable patterns, which is both expensive and risky. DRY violations can be identified using CodeScene's X-Ray analysis to detect clusters of change coupled functions with high code similarity. Read More
Once you have identified the similarities across functions, look to extract and encapsulate the concept that varies into its own function(s). These shared abstractions can then be re-used, which minimizes the amount of duplication and simplifies change.

…l success - Semantic validation: make `is_positive_term` recursive so that shapes such as `And[Or[Inside[Atom]]]` are correctly flagged as `MissingPositiveTermInAnd` instead of being accepted on the basis of the outer combinator shape alone. Add a targeted test for the `And[Or[Inside]]` case. - Normalization: replace the dependency-principal placeholder query `(ERROR) @_dependency_check` with `(__NONEXISTENT_NODE__) @_dependency_check` so it cannot match real Tree-sitter ERROR nodes. - BDD: update `sempai_engine.feature` to assert that valid search rules (including `r2c-internal-project-depends-on`) now compile successfully with a single query plan, add the matching `Then` steps, and switch the test world's `compile_result` to `Result<Vec<QueryPlan>, DiagnosticReport>`. - Engine tests: collapse the two `compile_yaml_*success*` cases into a single `#[rstest]` parameterized test covering both the legacy pattern and the project-depends-on search rule. - Docs: correct the ExecPlan to describe `Decorated<T>` with the shipped field names (`where_clauses`, `as_name`, `fix`, `span: Option<SourceSpan>`), and make explicit that normalization lives in `crates/sempai/src/normalize.rs` and semantic validation in `crates/sempai/src/semantic_check.rs`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

crates/sempai/src/normalize.rs (1)

25-25: ⚠️ Potential issue | 🟡 Minor

Fix the malformed v2 regex doc bullet.

Close the inline literal: change `regex: "...` to `regex: "..."`.

Proposed fix

-//! - `regex: "...` → `Formula::Atom(Atom::Regex(...))`
+//! - `regex: "..."` → `Formula::Atom(Atom::Regex(...))`

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/sempai/src/normalize.rs` at line 25, The doc comment in
crates/sempai/src/normalize.rs contains a malformed inline literal in the v2
regex bullet (`regex: "...`); update that inline backtick sequence to close
properly so it reads `regex: "..."` (i.e., change the broken `` `regex: "..."``
to `` `regex: "..."` ``) to fix the documentation formatting.

crates/sempai/src/semantic_check.rs (1)

48-65: ⚠️ Potential issue | 🟠 Major

Nested negations inside Or branches slip through validation.

The current check at line 51 only detects Formula::Not at the immediate branch level. A structure like Or([And([Atom(..), Not(..)])]) passes because the recursion at line 62 drops the disjunction context before reaching the nested Not.

Add a helper that recursively searches for any Not within a branch before rejecting it.

Proposed fix

+fn branch_contains_not(formula: &Formula) -> bool {
+    match formula {
+        Formula::Not(_) => true,
+        Formula::And(branches) | Formula::Or(branches) => {
+            branches.iter().any(|branch| branch_contains_not(&branch.node))
+        }
+        Formula::Inside(inner) | Formula::Anywhere(inner) => branch_contains_not(&inner.node),
+        Formula::Atom(_) => false,
+    }
+}
+
 fn check_no_not_in_or(
     formula: &Formula,
     span: Option<&sempai_core::SourceSpan>,
 ) -> Result<(), DiagnosticReport> {
     match formula {
         Formula::Or(branches) => {
             for branch in branches {
-                if matches!(branch.node, Formula::Not(_)) {
+                if branch_contains_not(&branch.node) {
                     return Err(DiagnosticReport::validation_error(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/sempai/src/semantic_check.rs` around lines 48 - 65, The current check
in check_no_not_in_or only rejects immediate Formula::Not children of a
Formula::Or branch but loses the "inside Or" context when it recurses, so nested
Not nodes slip through; modify check_no_not_in_or (or add a small helper like
contains_not_recursive) that, when called on each branch.node of a Formula::Or,
walks the formula tree and returns true if any Formula::Not is found anywhere
inside that branch (without clearing the "in-or" context), and then produce the
existing DiagnosticReport using branch.span.clone().or_else(|| span.cloned()) if
the helper finds a Not; keep using the same
DiagnosticCode::ESempaiInvalidNotInOr and error message text.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/sempai/src/tests/engine_tests.rs`:
- Around line 103-135: Extract a shared test helper or parameterize the two
tests using rstest: move the common setup (default_engine(),
engine.compile_yaml(...), first_diagnostic_of_err(result)) into a helper
function like assert_compile_yaml_semantic_error(yaml: &str) -> DiagnosticCode
and call it from the existing tests, or replace both test functions
(compile_yaml_returns_semantic_error_for_invalid_not_in_or and
compile_yaml_returns_semantic_error_for_missing_positive_term_in_and) with a
single #[rstest] that provides the YAML string and expected DiagnosticCode pairs
and asserts equality using default_engine(), engine.compile_yaml(...),
first_diagnostic_of_err(result), and DiagnosticCode::ESempaiInvalidNotInOr /
DiagnosticCode::ESempaiMissingPositiveTermInAnd; ensure the helper/parametrized
test references the same symbols (default_engine, engine.compile_yaml,
first_diagnostic_of_err, DiagnosticCode) so behavior is unchanged.

In `@crates/sempai/src/tests/semantic_validation_tests.rs`:
- Around line 151-180: Add a new unit test asserting validation fails when a Not
is nested inside an And that is itself a branch of an Or (e.g., Or[ And[ Atom,
Not[...] ], ... ]). Create a test (e.g., nested_not_in_and_inside_or_fails) that
constructs a Decorated node using Formula::Or containing a Decorated
Formula::And which includes make_pattern("atom") and a Decorated Formula::Not,
call validate_formula(&formula), expect_err, and assert the first diagnostic
code equals DiagnosticCode::ESempaiInvalidNotInOr to ensure the recursive
branch_contains_not detection catches deeply nested negations.

---

Duplicate comments:
In `@crates/sempai/src/normalize.rs`:
- Line 25: The doc comment in crates/sempai/src/normalize.rs contains a
malformed inline literal in the v2 regex bullet (`regex: "...`); update that
inline backtick sequence to close properly so it reads `regex: "..."` (i.e.,
change the broken `` `regex: "..."`` to `` `regex: "..."` ``) to fix the
documentation formatting.

In `@crates/sempai/src/semantic_check.rs`:
- Around line 48-65: The current check in check_no_not_in_or only rejects
immediate Formula::Not children of a Formula::Or branch but loses the "inside
Or" context when it recurses, so nested Not nodes slip through; modify
check_no_not_in_or (or add a small helper like contains_not_recursive) that,
when called on each branch.node of a Formula::Or, walks the formula tree and
returns true if any Formula::Not is found anywhere inside that branch (without
clearing the "in-or" context), and then produce the existing DiagnosticReport
using branch.span.clone().or_else(|| span.cloned()) if the helper finds a Not;
keep using the same DiagnosticCode::ESempaiInvalidNotInOr and error message
text.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7ce5cff4-a1da-49ae-ab84-e1a14bd51d32

📥 Commits

Reviewing files that changed from the base of the PR and between bd3f5ee and ed1c94e.

📒 Files selected for processing (7)

crates/sempai/src/normalize.rs
crates/sempai/src/semantic_check.rs
crates/sempai/src/tests/behaviour.rs
crates/sempai/src/tests/engine_tests.rs
crates/sempai/src/tests/semantic_validation_tests.rs
crates/sempai/tests/features/sempai_engine.feature
docs/execplans/4-1-5-normalization-into-canonical-formula-model.md

coderabbitai · 2026-04-24T10:03:35Z

+#[test]
+fn nested_or_in_and_with_not_fails() {
+    let nested_or = Decorated {
+        node: Formula::Or(vec![
+            make_pattern("a"),
+            Decorated {
+                node: Formula::Not(Box::new(make_pattern("b"))),
+                where_clauses: vec![],
+                as_name: None,
+                fix: None,
+                span: None,
+            },
+        ]),
+        where_clauses: vec![],
+        as_name: None,
+        fix: None,
+        span: None,
+    };
+    let formula = Decorated {
+        node: Formula::And(vec![make_pattern("foo"), nested_or]),
+        where_clauses: vec![],
+        as_name: None,
+        fix: None,
+        span: None,
+    };
+    let result = validate_formula(&formula);
+    let err = result.expect_err("should fail validation");
+    let first = err.diagnostics().first().expect("should have diagnostic");
+    assert_eq!(first.code(), DiagnosticCode::ESempaiInvalidNotInOr);
+}


🧹 Nitpick | 🔵 Trivial

Add a test case for deeply nested Not inside Or branch.

This test validates And[foo, Or[a, Not[b]]], but the nested Not is a direct child of the Or. Once the branch_contains_not fix is applied to semantic_check.rs, add a test for Or[And[Atom, Not[...]]] to verify the recursive detection catches negations buried deeper within a branch.

#!/bin/bash # Description: Check if any test exercises Or with And containing Not as a nested child. # Expected: No existing test for Or([And([Atom, Not])]) pattern. rg -n 'Formula::Or.*Formula::And.*Formula::Not' crates/sempai/src/tests/

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/sempai/src/tests/semantic_validation_tests.rs` around lines 151 - 180, Add a new unit test asserting validation fails when a Not is nested inside an And that is itself a branch of an Or (e.g., Or[ And[ Atom, Not[...] ], ... ]). Create a test (e.g., nested_not_in_and_inside_or_fails) that constructs a Decorated node using Formula::Or containing a Decorated Formula::And which includes make_pattern("atom") and a Decorated Formula::Not, call validate_formula(&formula), expect_err, and assert the first diagnostic code equals DiagnosticCode::ESempaiInvalidNotInOr to ensure the recursive branch_contains_not detection catches deeply nested negations.

leynos · 2026-04-24T18:30:20Z

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response.

crates/sempai/src/tests/engine_tests.rs

Comment on lines +121 to +134

fn compile_yaml_returns_semantic_error_for_missing_positive_term_in_and() {
    let engine = default_engine();
    let result = engine.compile_yaml(concat!(
        "rules:\n",
        "  - id: demo.missing.positive.term.in.and\n",
        "    message: missing positive term in and\n",
        "    languages: [rust]\n",
        "    severity: ERROR\n",
        "    patterns:\n",
        "      - pattern-not: foo($X)\n",
        "      - pattern-inside: bar($Y)\n",
    ));
    let (code, _diag) = first_diagnostic_of_err(result);
    assert_eq!(code, DiagnosticCode::ESempaiMissingPositiveTermInAnd);

❌ New issue: Code Duplication
The module contains 2 functions with similar structure: compile_yaml_returns_semantic_error_for_invalid_not_in_or,compile_yaml_returns_semantic_error_for_missing_positive_term_in_and

Refactored repeated code in engine_tests.rs by introducing a helper function `assert_compile_yaml_semantic_error`. This DRYs up tests that check for expected semantic error diagnostic codes from YAML compilation. Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>

codescene-delta-analysis

Gates Failed
Enforce advisory code health rules (1 file with Code Duplication)

Gates Passed
5 Quality Gates Passed

See analysis details in CodeScene

Reason for failure

Enforce advisory code health rules	Violations	Code Health Impact
engine_tests.rs	1 advisory rule	10.00 → 9.39	Suppress

Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.

codescene-delta-analysis · 2026-04-24T18:35:14Z

+fn compile_yaml_returns_semantic_error_for_missing_positive_term_in_and() {
+    assert_compile_yaml_semantic_error(
+        concat!(
+            "rules:\n",
+            "  - id: demo.missing.positive.term.in.and\n",
+            "    message: missing positive term in and\n",
+            "    languages: [rust]\n",
+            "    severity: ERROR\n",
+            "    patterns:\n",
+            "      - pattern-not: foo($X)\n",
+            "      - pattern-inside: bar($Y)\n",
+        ),
+        DiagnosticCode::ESempaiMissingPositiveTermInAnd,
+    );


❌ New issue: Code Duplication
The module contains 2 functions with similar structure: compile_yaml_returns_semantic_error_for_invalid_not_in_or,compile_yaml_returns_semantic_error_for_missing_positive_term_in_and

_Suppress