Skip to content

Implement normalization into canonical Formula with semantic checks#104

Open
leynos wants to merge 8 commits intomainfrom
implement-normalization-canonical-formula-6op9mx
Open

Implement normalization into canonical Formula with semantic checks#104
leynos wants to merge 8 commits intomainfrom
implement-normalization-canonical-formula-6op9mx

Conversation

@leynos
Copy link
Copy Markdown
Owner

@leynos leynos commented Apr 11, 2026

Summary

  • Introduces a canonical Formula model in sempai-core and a normalization pipeline to unify legacy pattern* syntax and v2 match syntax into a single representation in sempai_core.
  • Adds semantic validation for invalid constructs (e.g. InvalidNotInOr, MissingPositiveTermInAnd).
  • Replaces the NOT_IMPLEMENTED placeholder in Engine::compile_yaml with real normalization, validation, and QueryPlan construction.
  • Aligns tooling, tests, and docs toward a live normalization path across core, YAML parsing, and engine layers.

Changes

Core modelling

  • Add canonical Formula types in sempai-core:
    • Formula enum with variants: Atom, Not, Inside, Anywhere, And, Or.
    • Atom enum: Pattern, Regex, TreeSitterQuery.
    • Lightweight leaf wrappers: PatternAtom, RegexAtom, TreeSitterQueryAtom.
    • Decorated<T> wrapper carrying where_clauses, as_name, fix, and span.
    • WhereClause as an opaque container (initially storing a raw JSON value).
  • Expose and re-export new types from public API (e.g. pub mod formula; and re-export from sempai-core lib).

Normalization layer

  • Introduce normalization entry points to convert both legacy formulas and v2 match formulas into Decorated<Formula>:
    • Legacy normalization maps pattern, pattern-regex, patterns, pattern-either, pattern-inside, etc. to canonical Formula structures.
    • v2 normalization maps match forms (all, any, not, inside, anywhere, etc.) to canonical shapes.
    • Attach WhereClauses to the enclosing Decorated<Formula> for constraints carried over from legacy forms.
    • Propagate Decorated metadata (where, as, fix) during normalization.

Semantic validation

  • Add semantic checks for canonical formulas:
    • InvalidNotInOr: reject negated branches inside disjunctions.
    • MissingPositiveTermInAnd: require at least one positive term in an And.
  • Implement recursive validation across nested formula trees and emit deterministic diagnostics (codes like E_SEMPAI_INVALID_NOT_IN_OR, E_SEMPAI_MISSING_POSITIVE_TERM_IN_AND).

Engine integration

  • Wire normalization into Engine::compile_yaml:
    1. Parse rules as before.
    2. For each rule, normalize its principal into Decorated<Formula>.
    3. Run semantic validation on the normalized formula.
    4. Build and return a QueryPlan containing the normalized formula and rule metadata.
  • Update QueryPlan to store the Decorated<Formula> instead of the prior placeholder.

Testing

  • Add unit tests in:
    • sempai-core for construction and equality of Formula variants.
    • sempai for normalization and semantic validation scenarios.
  • Extend/align BDD tests to cover happy paths, negative paths (invalid-not-in-or, missing-positive-term-in-and), and parity between legacy and v2 fixtures.

Documentation

  • Documentation updates to reflect normalization as the live path:
    • docs/sempai-query-language-design.md (note canonical Formula model and WhereClause handling).
    • docs/users-guide.md (update to reflect compile_yaml returning compiled plans for valid rules).
    • docs/roadmap.md (mark 4.1.5 as done).

Validation and acceptance criteria

  • Engine::compile_yaml returns QueryPlans with normalized Decorated<Formula> for valid rules (no more NOT_IMPLEMENTED).
  • Legacy and v2 fixtures normalize to structurally equivalent Formula values.
  • Semantic diagnostics are emitted for invalid forms (e.g. invalid not-in-or, missing positive term).
  • Tests pass for new normalization and semantic checks; existing tests adapt to new behavior where necessary.
  • Documentation and roadmap reflect the completed milestone.

How to test locally

  • Build docs and run checks:
    • cargo doc -p sempai_core
    • make lint
    • make check-fmt
    • make markdownlint
  • Run tests:
    • cargo test -p sempai_core
    • cargo test -p sempai
    • cargo test -p sempai_yaml
  • Exercise engine path:
    • cargo test -p sempai

Notes on compatibility and scope

  • Public API changes are limited to internal canonical-form representations; the existing sempai_yaml::parse_rule_file signature remains unchanged.
  • The new modules are designed to be incremental (Stage B–D) with Stage E–H focused on wiring, testing, and documentation.
  • New functionality should be covered by targeted tests and BDD scenarios before release.

📎 Task: https://www.devboxer.com/task/795de59f-09d8-463e-9ef9-350b92fa57be

This adds a new normalization step that converts legacy 'pattern*' and v2 'match' syntax into a unified canonical 'Formula' model defined in sempai_core. It includes semantic validation for structural constraints like InvalidNotInOr and MissingPositiveTermInAnd, replacing the previous NOT_IMPLEMENTED stub in Engine::compile_yaml with real normalized query plans. Legacy and v2 rules now produce structurally equivalent formulas, and invalid formulas emit deterministic diagnostics.

Additional changes include new formula-related types and validation logic in sempai_core and wiring normalization into the compile pipeline. Extensive tests and documentation updates accompany this change to ensure correctness and guide users.

No breaking API changes or new external dependencies introduced.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7eb5caf1-b51b-4f68-a175-4300b18f0c6c

📥 Commits

Reviewing files that changed from the base of the PR and between ed1c94e and 5f21759.

📒 Files selected for processing (1)
  • crates/sempai/src/tests/engine_tests.rs

Canonical Formula Model with Normalization and Semantic Validation

This PR implements a canonical Formula model, an infallible normalization pipeline that unifies legacy pattern* and v2 match syntaxes into a single Decorated representation, and deterministic semantic checks. See design/execplan: docs/execplans/4-1-5-normalization-into-canonical-formula-model.md (new).

Core Changes

  • crates/sempai-core

    • New module crates/sempai-core/src/formula.rs exporting the canonical model:
      • Formula enum: Atom, Not, Inside, Anywhere, And, Or.
      • Atom enum: Pattern, Regex, TreeSitterQuery with leaf structs PatternAtom, RegexAtom, TreeSitterQueryAtom.
      • Generic Decorated wrapper (where_clauses, as_name, fix, span).
      • WhereClause as an opaque container holding raw serde_json::Value.
    • crates/sempai-core/src/lib.rs: exposes pub mod formula.
    • crates/sempai-core/Cargo.toml: added serde_json as a regular dependency (it remains in dev-dependencies too).
  • Normalization (crates/sempai/src/normalize.rs)

    • Added pub(crate) fn normalize_search_principal(principal: &SearchQueryPrincipal, rule_span: Option<&SourceSpan>) -> Decorated.
    • Lowers legacy pattern/regex/constraint/negation constructs and v2 match forms into Decorated, attaches legacy patterns constraints as WhereClause on enclosing Decorated nodes, and propagates v2 decoration metadata (where_clauses, as_name, fix, span).
    • Includes helper bare(...) to reduce duplication and a compatibility path mapping r2c-internal-project-depends-on to a degenerate Tree-sitter query atom.
  • Semantic validation (crates/sempai/src/semantic_check.rs)

    • Added pub(crate) fn validate_formula(formula: &Decorated) -> Result<(), DiagnosticReport>.
    • Enforces:
      • E_SEMPAI_INVALID_NOT_IN_OR: disallow Not directly under Or (anchored to offending span when available).
      • E_SEMPAI_MISSING_POSITIVE_TERM_IN_AND: require at least one positive term in And (anchored to branch or And span).
    • Validation is recursive and deterministic.
  • Engine integration (crates/sempai/src/engine.rs)

    • QueryPlan now stores formula: Decorated (replacing prior placeholder).
    • QueryPlan::new updated (crate-visible) to accept the formula; added accessor QueryPlan::formula().
    • Engine::compile_yaml now parses → normalizes → validates → constructs per-language QueryPlan(s) for search principals; unsupported languages produce schema diagnostics. Compatibility-only principals are normalized rather than short-circuited to NOT_IMPLEMENTED.

Tests and Behaviour

  • Added/updated tests:
    • Normalization tests: crates/sempai/src/tests/normalization_tests.rs covering legacy ↔ v2 parity, unary/branch forms, constraint propagation, decoration preservation and span retention.
    • Semantic validation tests: crates/sempai/src/tests/semantic_validation_tests.rs covering acceptance and the two failure modes.
    • Engine and behaviour tests updated to use new QueryPlan constructor/signature and to assert compiled plans for valid search rules.
    • BDD feature updated to assert successful compilation and plan counts for relevant fixtures.
  • Review feedback identified test and implementation duplication; small refactors and helpers (normalize_legacy/normalize_v2, assert_compile_yaml_semantic_error, bare()) were proposed and a test refactor commit was recorded to DRY engine tests.

Documentation

  • New execplan: docs/execplans/4-1-5-normalization-into-canonical-formula-model.md (added).
  • Updated docs/sempai-query-language-design.md and docs/users-guide.md to describe normalization location (crates/sempai/src/normalize.rs), semantic checks (crates/sempai/src/semantic_check.rs), diagnostic codes, and updated compile_yaml behaviour.
  • docs/roadmap.md: milestone 4.1.5 marked done.

Public API Surface and Compatibility

  • Public additions: sempai_core::formula module and types (Formula, Decorated, Atom variants) exported from sempai-core.
  • QueryPlan constructor signature change is crate-visible (no public API break for external crates); QueryPlan::formula() accessor added.
  • Normalization and validation functions remain pub(crate); parse_rule_file signature unchanged.
  • Behavioural change visible to users: Engine::compile_yaml returns Ok(Vec) for valid search rules (previously held a NOT_IMPLEMENTED placeholder path for some compatibility-only rules).

Diagnostics

  • Deterministic diagnostic codes and semantics:
    • E_SEMPAI_INVALID_NOT_IN_OR — Not directly under Or (primary span when available).
    • E_SEMPAI_MISSING_POSITIVE_TERM_IN_AND — And without any positive term (primary span when available).

For Reviewers

Focus review on canonical model correctness, normalization semantics (legacy → canonical and v2 → canonical), semantic-check recursion and positive-term logic and span anchoring, engine wiring in compile_yaml, and test coverage/parity with legacy fixtures. Relevant design/execplan: docs/execplans/4-1-5-normalization-into-canonical-formula-model.md.

Walkthrough

Introduce a public canonical Formula AST and Decorated wrapper; add normalization from legacy pattern* and v2 match principals into Decorated<Formula>; add semantic validation; and wire normalization/validation into Engine::compile_yaml so it returns per-language QueryPlan(s) instead of the previous NOT_IMPLEMENTED path. (50 words)

Changes

Cohort / File(s) Summary
Core Formula Model
crates/sempai-core/Cargo.toml, crates/sempai-core/src/formula.rs, crates/sempai-core/src/lib.rs
Add serde_json dependency; introduce public canonical Formula AST, Atom variants (Pattern/Regex/TreeSitterQuery), leaf atom structs, Decorated<T> wrapper with metadata, and WhereClause; expose formula module publicly.
Normalisation Module
crates/sempai/src/normalize.rs
Add normalize_search_principal to lower legacy pattern* and v2 match principals into Decorated<Formula>, map combinators/unaries, propagate where-clauses and metadata, and represent ProjectDependsOn as a degenerate Tree-sitter atom.
Semantic Validation
crates/sempai/src/semantic_check.rs
Add validate_formula enforcing: disallow Not directly in an Or (ESempaiInvalidNotInOr) and require at least one positive term per And (ESempaiMissingPositiveTermInAnd); produce diagnostics anchored to spans.
Engine Integration
crates/sempai/src/engine.rs, crates/sempai/src/lib.rs
Change QueryPlan to store Decorated<Formula>; update constructor and add formula() accessor; implement Engine::compile_yaml to parse YAML, normalise search principals, validate formulas and build per-language QueryPlans; unsupported languages produce schema diagnostics instead of NOT_IMPLEMENTED.
Tests
crates/sempai/src/tests/...
crates/sempai/src/tests/normalization_tests.rs, crates/sempai/src/tests/semantic_validation_tests.rs, crates/sempai/src/tests/engine_tests.rs, crates/sempai/src/tests/behaviour.rs, crates/sempai/src/tests/mod.rs
Add comprehensive normalisation and semantic-validation tests; update behaviour and engine tests to expect successful compilation and to construct QueryPlan::new with a Decorated<Formula> dummy where needed; assert new diagnostic codes and plan counts.
Documentation & Roadmap
docs/execplans/4-1-5-normalization-into-canonical-formula-model.md, docs/roadmap.md, docs/sempai-query-language-design.md, docs/users-guide.md
Add execplan for milestone 4.1.5 and mark it complete; document where normalisation and validation live, list validation error codes, describe r2c-internal-project-depends-on normalisation, and update user guide to reflect compile_yaml returning QueryPlan(s).

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Engine as Engine
    participant Parser as Parser
    participant Normaliser as Normaliser
    participant Validator as Validator
    participant Compiler as Compiler
    participant QueryPlan as QueryPlan

    Client->>Engine: compile_yaml(yaml)
    Engine->>Parser: parse YAML -> Vec<Rule>
    Parser-->>Engine: rules with SearchQueryPrincipal

    loop For each search-mode Rule
        Engine->>Normaliser: normalize_search_principal(principal, span)
        Normaliser-->>Engine: Decorated<Formula>

        Engine->>Validator: validate_formula(Decorated<Formula>)
        alt Validation passes
            Validator-->>Engine: Ok(())
            Engine->>Compiler: compile_rule_plans(rule, formula)
            Compiler->>QueryPlan: new(rule_id, language, formula)
            QueryPlan-->>Compiler: QueryPlan instance
            Compiler-->>Engine: Vec(QueryPlan)
        else Validation fails
            Validator-->>Engine: Err(DiagnosticReport)
            Engine-->>Client: Err(DiagnosticReport)
        end
    end

    Engine-->>Client: Ok(Vec(QueryPlan))
Loading

Possibly related PRs

Poem

🧩 Legacy patterns and v2 match weave tight,
Normalise and validate, send plans into flight,
Ban NOT inside OR, and demand a positive spark,
Engine compiles rules and lights up the dark,
Pipelines hum steady — the formula's alight! ✨

🚥 Pre-merge checks | ✅ 7 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning Title aligns with the main change (normalization into canonical Formula with semantic validation) but omits the roadmap reference (4.1.5) required by the title requirements. Revise title to include roadmap reference: 'Implement normalization into canonical Formula with semantic checks (4.1.5)'
Developer Documentation ⚠️ Warning No developers-guide.md exists to document new internal APIs (Formula, normalize.rs, semantic_check.rs), architectural boundaries, or the normalisation pipeline. Create docs/developers-guide.md documenting the canonical Formula model, normalisation pipeline, semantic validation logic, module responsibilities, public API boundaries, and integration within the compilation flow.
✅ Passed checks (7 passed)
Check name Status Explanation
Description check ✅ Passed Description thoroughly documents the summary, changes, validation criteria, and testing instructions, directly relating to the canonical Formula model and normalization pipeline introduced.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Testing ✅ Passed All new functionality is comprehensively covered by unit tests (16 normalisation tests, 7 validation tests), integration tests, and 9 BDD scenarios exercising legacy/v2 syntax variants, constraint handling, and end-to-end compilation workflows.
User-Facing Documentation ✅ Passed The docs/users-guide.md file comprehensively documents all new and changed user-facing functionality including the updated compile_yaml() behaviour, semantic validation, and QueryPlan returns.
Module-Level Documentation ✅ Passed All three new modules contain comprehensive module-level docstrings with overviews and examples. All publicly exported types and functions are fully documented with descriptions and error details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch implement-normalization-canonical-formula-6op9mx

Comment @coderabbitai help to get the list of available commands and usage tips.

codescene-delta-analysis[bot]

This comment was marked as outdated.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Apr 11, 2026

Reviewer's Guide

Implements a canonical Formula model in sempai-core and wires a full normalization + semantic validation pipeline from YAML search rules through to Engine::compile_yaml, replacing the NOT_IMPLEMENTED placeholder with real QueryPlan construction and updating tests and docs accordingly.

Sequence diagram for Engine::compile_yaml normalization and planning

sequenceDiagram
    actor User
    participant Engine
    participant YamlParser as sempai_yaml_parse_rule_file
    participant ModeValidator as validate_supported_modes
    participant Normalizer as normalize_search_principal
    participant Semantic as validate_formula
    participant Planner as compile_rule_plans

    User->>Engine: compile_yaml(yaml)
    Engine->>YamlParser: parse_rule_file(yaml, None)
    YamlParser-->>Engine: RuleFile or DiagnosticReport
    alt parse_error
        Engine-->>User: Err(DiagnosticReport)
    else parse_ok
        Engine->>ModeValidator: validate_supported_modes(file)
        ModeValidator-->>Engine: Ok or DiagnosticReport
        alt mode_error
            Engine-->>User: Err(DiagnosticReport)
        else modes_ok
            loop for each Rule in file.rules()
                Engine->>Engine: principal = rule.principal()
                alt principal_is_search
                    Engine->>Normalizer: normalize_search_principal(principal, rule.rule_span())
                    Normalizer-->>Engine: Decorated_Formula or DiagnosticReport
                    alt normalization_error
                        Engine-->>User: Err(DiagnosticReport)
                    else normalized
                        Engine->>Semantic: validate_formula(&formula)
                        Semantic-->>Engine: Ok or DiagnosticReport
                        alt semantic_error
                            Engine-->>User: Err(DiagnosticReport)
                        else validated
                            Engine->>Planner: compile_rule_plans(rule, &formula)
                            Planner-->>Engine: List~QueryPlan~ or DiagnosticReport
                            alt plan_error
                                Engine-->>User: Err(DiagnosticReport)
                            else plans_ok
                                Engine->>Engine: append plans to result
                            end
                        end
                    end
                else non_search_principal
                    Engine->>Engine: skip or unsupported mode handled earlier
                end
            end
            Engine-->>User: Ok(List~QueryPlan~)
        end
    end
Loading

Class diagram for canonical Formula model and decorators

classDiagram
    class Formula {
        +Atom atom
        +Not not
        +Inside inside
        +Anywhere anywhere
        +And and
        +Or or
    }

    class Atom {
        +Pattern pattern
        +Regex regex
        +TreeSitterQuery tree_sitter_query
    }

    class PatternAtom {
        +String text
    }

    class RegexAtom {
        +String pattern
    }

    class TreeSitterQueryAtom {
        +String query
    }

    class Decorated_Formula {
        +Formula node
        +List~WhereClause~ where_clauses
        +Option~String~ as_name
        +Option~String~ fix
        +Option~SourceSpan~ span
    }

    class WhereClause {
        +Value raw
    }

    class SourceSpan

    Formula --> Atom : uses
    Atom --> PatternAtom : has
    Atom --> RegexAtom : has
    Atom --> TreeSitterQueryAtom : has
    Decorated_Formula --> Formula : wraps
    Decorated_Formula --> WhereClause : contains
    Decorated_Formula --> SourceSpan : optional
Loading

Class diagram for Engine normalization and QueryPlan

classDiagram
    class Engine {
        -EngineConfig config
        +compile_yaml(yaml : &str) Result~List~QueryPlan~~, DiagnosticReport~
    }

    class QueryPlan {
        -String rule_id
        -Language language
        -Decorated_Formula formula
        +new(rule_id : String, language : Language, formula : Decorated_Formula) QueryPlan
        +rule_id() String
        +language() Language
        +formula() &Decorated_Formula
    }

    class Decorated_Formula {
        +Formula node
        +List~WhereClause~ where_clauses
        +Option~String~ as_name
        +Option~String~ fix
        +Option~SourceSpan~ span
    }

    class Formula
    class WhereClause
    class SourceSpan
    class EngineConfig
    class DiagnosticReport
    class Language
    class Rule {
        +id() &str
        +languages() &List~String~
        +rule_span() Option~SourceSpan~
        +principal() &RulePrincipal
    }

    class RulePrincipal {
        +Search search
        +Other other
    }

    class SearchQueryPrincipal

    class NormalizeModule {
        +normalize_search_principal(principal : &SearchQueryPrincipal, rule_span : Option~&SourceSpan~) Result~Decorated_Formula, DiagnosticReport~
    }

    class SemanticCheckModule {
        +validate_formula(formula : &Decorated_Formula) Result~(), DiagnosticReport~
    }

    Engine --> EngineConfig : has
    Engine ..> QueryPlan : creates
    Engine ..> Rule : uses
    Engine ..> NormalizeModule : calls
    Engine ..> SemanticCheckModule : calls
    Engine ..> DiagnosticReport : returns

    QueryPlan --> Language : has
    QueryPlan --> Decorated_Formula : has

    Rule --> RulePrincipal : has
    RulePrincipal --> SearchQueryPrincipal : may_wrap

    NormalizeModule ..> Decorated_Formula : returns
    NormalizeModule ..> DiagnosticReport : error

    SemanticCheckModule ..> DiagnosticReport : error
Loading

File-Level Changes

Change Details Files
Introduce canonical Formula AST and decoration model in sempai-core for use by normalization and engine layers.
  • Add formula.rs defining Formula, Atom, PatternAtom, RegexAtom, TreeSitterQueryAtom, Decorated, and WhereClause types with serde_json-backed where clauses and optional spans.
  • Expose the new formula module from sempai-core by adding a public mod and re-exporting it.
  • Add serde_json as a dependency of sempai-core for WhereClause raw JSON storage.
crates/sempai-core/src/formula.rs
crates/sempai-core/src/lib.rs
crates/sempai-core/Cargo.toml
Implement normalization from legacy and v2 search principals into Decorated and semantic validation over the canonical formula tree, then wire this into Engine::compile_yaml and QueryPlan.
  • Add normalize.rs implementing normalize_search_principal plus helpers to map LegacyFormula/MatchFormula/ProjectDependsOn into Decorated, including handling of patterns, regexes, boolean combinators, context operators, constraints as WhereClause, and a degenerate dependency formula.
  • Add semantic_check.rs with validate_formula enforcing InvalidNotInOr and MissingPositiveTermInAnd via recursive checks that emit structured DiagnosticReports with the appropriate DiagnosticCode and spans.
  • Refactor Engine::compile_yaml to parse rules, validate modes, normalize each Search principal to Decorated, run validate_formula, and then build per-language QueryPlans via a new compile_rule_plans helper; replace the previous NOT_IMPLEMENTED diagnostic return path.
  • Change QueryPlan to hold a Decorated instead of the unit _plan placeholder, add a formula() accessor, and update its constructor signature and all call sites (including tests and BDD world) to pass a dummy formula where needed.
  • Import the new normalization and semantic validation modules from lib.rs so they are available within the sempai crate.
crates/sempai/src/normalize.rs
crates/sempai/src/semantic_check.rs
crates/sempai/src/engine.rs
crates/sempai/src/lib.rs
crates/sempai/src/tests/behaviour.rs
crates/sempai/src/tests/engine_tests.rs
Add focused unit tests for normalization and semantic validation behaviour over the canonical Formula model.
  • Create normalization_tests.rs to assert that various LegacyFormula and MatchFormula inputs normalize into the expected Formula shapes (Atoms, And/Or, Not/Inside/Anywhere, compound legacy operators like pattern-not-inside and pattern-not-regex).
  • Create semantic_validation_tests.rs to exercise validate_formula success and failure paths, including Or with Not branches, And without positive terms, and nested Or-in-And cases.
  • Wire the new test modules into the sempai test suite via mod declarations in tests/mod.rs and adjust existing engine tests to use dummy Decorated instances when constructing QueryPlan.
  • Add helper functions in tests (e.g., dummy_formula) to construct minimal Decorated required by the new QueryPlan API.
crates/sempai/src/tests/normalization_tests.rs
crates/sempai/src/tests/semantic_validation_tests.rs
crates/sempai/src/tests/mod.rs
crates/sempai/src/tests/engine_tests.rs
Update documentation and roadmap to describe the live normalization path, semantic checks, and completed milestone, and add an execution plan document for this work.
  • Revise the users guide to describe compile_yaml as producing canonical Formula-based QueryPlans for valid search rules, detailing semantic diagnostics and behaviour for unsupported modes and project-depends-on rules.
  • Add an implementation note to sempai-query-language-design.md explaining where normalization and semantic validation live in the codebase, how project-depends-on is represented, and how WhereClause is currently handled as opaque JSON.
  • Mark roadmap milestone 4.1.5 as completed and add a new execplans/4-1-5-normalization-into-canonical-formula-model.md document laying out constraints, risks, stages, interfaces, and validation criteria for this normalization work.
  • Ensure references in docs to NOT_IMPLEMENTED placeholders are updated to reflect the new behaviour and error modes.
docs/users-guide.md
docs/sempai-query-language-design.md
docs/roadmap.md
docs/execplans/4-1-5-normalization-into-canonical-formula-model.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

…model

- Introduce a canonical `Formula` enum in sempai-core for normalized queries.
- Implement normalization from legacy (`pattern*`) and v2 (`match`) syntaxes into the shared formula model.
- Add semantic validation for formulas enforcing constraints on `Or` and `And` compositions.
- Update engine to compile normalized formulas into query plans per language.
- Normalize compatibility-only `r2c-internal-project-depends-on` rules into a degenerate formula.
- Enhance diagnostics with semantic validation errors and maintain accurate error spans.
- Add extensive tests covering normalization and semantic validation.
- Update documentation to reflect the new normalization and validation implementation.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
@leynos leynos changed the title Normalize legacy and v2 rules into canonical Formula model Plan normalization into canonical Formula model Apr 11, 2026
codescene-delta-analysis[bot]

This comment was marked as outdated.

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Apr 13, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix:

crates/sempai/src/tests/normalization_tests.rs

Comment on file

fn legacy_pattern_normalizes_to_atom() {
    let legacy = LegacyFormula::Pattern(String::from("foo($X)"));
    let principal = SearchQueryPrincipal::Legacy(legacy);
    let result = normalize_search_principal(&principal, None).expect("should normalize");
    assert_eq!(
        result.node,
        Formula::Atom(Atom::Pattern(PatternAtom {
            text: String::from("foo($X)")
        }))
    );
}

❌ New issue: Code Duplication
The module contains 6 functions with similar structure: legacy_pattern_normalizes_to_atom,legacy_pattern_regex_normalizes_to_regex_atom,v2_all_normalizes_to_and,v2_any_normalizes_to_or and 2 more functions

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Apr 13, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix:

crates/sempai/src/normalize.rs

Comment on file

fn normalize_match(
    formula: &MatchFormula,
    fallback_span: Option<SourceSpan>,
) -> Decorated<Formula> {
    match formula {
        MatchFormula::Pattern(text) | MatchFormula::PatternObject(text) => Decorated {
            node: Formula::Atom(Atom::Pattern(PatternAtom { text: text.clone() })),
            where_clauses: vec![],
            as_name: None,
            fix: None,
            span: fallback_span,
        },
        MatchFormula::Regex(pattern) => Decorated {
            node: Formula::Atom(Atom::Regex(RegexAtom {
                pattern: pattern.clone(),
            })),
            where_clauses: vec![],
            as_name: None,
            fix: None,
            span: fallback_span,
        },
        MatchFormula::All(branches) => {
            let normalized_branches = branches
                .iter()
                .map(|branch| normalize_match(branch, fallback_span.clone()))
                .collect();
            Decorated {
                node: Formula::And(normalized_branches),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        MatchFormula::Any(branches) => {
            let normalized_branches = branches
                .iter()
                .map(|branch| normalize_match(branch, fallback_span.clone()))
                .collect();
            Decorated {
                node: Formula::Or(normalized_branches),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        MatchFormula::Not(inner) => {
            let normalized_inner = normalize_match(inner, fallback_span.clone());
            Decorated {
                node: Formula::Not(Box::new(normalized_inner)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        MatchFormula::Inside(inner) => {
            let normalized_inner = normalize_match(inner, fallback_span.clone());
            Decorated {
                node: Formula::Inside(Box::new(normalized_inner)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        MatchFormula::Anywhere(inner) => {
            let normalized_inner = normalize_match(inner, fallback_span.clone());
            Decorated {
                node: Formula::Anywhere(Box::new(normalized_inner)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        MatchFormula::Decorated {
            formula: inner_formula,
            where_clauses: raw_where,
            as_name,
            fix,
        } => {
            let mut normalized = normalize_match(inner_formula, fallback_span);
            normalized.where_clauses = raw_where
                .iter()
                .map(|raw| WhereClause { raw: raw.clone() })
                .collect();
            normalized.as_name.clone_from(as_name);
            normalized.fix.clone_from(fix);
            normalized
        }
    }
}

❌ New issue: Large Method
normalize_match has 94 lines, threshold = 70

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Apr 13, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix:

crates/sempai/src/normalize.rs

Comment on file

fn normalize_legacy(
    formula: &LegacyFormula,
    fallback_span: Option<SourceSpan>,
) -> Decorated<Formula> {
    match formula {
        LegacyFormula::Pattern(text) => Decorated {
            node: Formula::Atom(Atom::Pattern(PatternAtom { text: text.clone() })),
            where_clauses: vec![],
            as_name: None,
            fix: None,
            span: fallback_span,
        },
        LegacyFormula::PatternRegex(pattern) => Decorated {
            node: Formula::Atom(Atom::Regex(RegexAtom {
                pattern: pattern.clone(),
            })),
            where_clauses: vec![],
            as_name: None,
            fix: None,
            span: fallback_span,
        },
        LegacyFormula::Patterns(clauses) => normalize_legacy_patterns(clauses, fallback_span),
        LegacyFormula::PatternEither(branches) => {
            let normalized_branches = branches
                .iter()
                .map(|branch| normalize_legacy(branch, fallback_span.clone()))
                .collect();
            Decorated {
                node: Formula::Or(normalized_branches),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        LegacyFormula::PatternNot(value) => {
            let inner = normalize_legacy_value(value, fallback_span.clone());
            Decorated {
                node: Formula::Not(Box::new(inner)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        LegacyFormula::PatternInside(value) => {
            let inner = normalize_legacy_value(value, fallback_span.clone());
            Decorated {
                node: Formula::Inside(Box::new(inner)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        LegacyFormula::PatternNotInside(value) => {
            let inner = normalize_legacy_value(value, fallback_span.clone());
            let inside = Decorated {
                node: Formula::Inside(Box::new(inner)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span.clone(),
            };
            Decorated {
                node: Formula::Not(Box::new(inside)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        LegacyFormula::PatternNotRegex(pattern) => {
            let regex_atom = Decorated {
                node: Formula::Atom(Atom::Regex(RegexAtom {
                    pattern: pattern.clone(),
                })),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span.clone(),
            };
            Decorated {
                node: Formula::Not(Box::new(regex_atom)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
        LegacyFormula::Anywhere(value) => {
            let inner = normalize_legacy_value(value, fallback_span.clone());
            Decorated {
                node: Formula::Anywhere(Box::new(inner)),
                where_clauses: vec![],
                as_name: None,
                fix: None,
                span: fallback_span,
            }
        }
    }
}

❌ New issue: Large Method
normalize_legacy has 102 lines, threshold = 70

@coderabbitai

This comment was marked as resolved.

@coderabbitai

This comment was marked as resolved.

@coderabbitai

This comment was marked as resolved.

leynos and others added 2 commits April 17, 2026 21:34
…n length

Add a private `bare()` helper function to construct Decorated nodes without
metadata, replacing repeated struct literals throughout normalize.rs. This
reduces `normalize_legacy` from 102 to 51 lines and `normalize_match` from
94 to 58 lines, bringing both under the 70-line clippy threshold.

Changes:
- Add `bare()` helper function with clear documentation
- Replace all inline Decorated struct literals in normalize_legacy
- Replace all inline Decorated struct literals in normalize_match
- Apply bare() to normalize_legacy_value and normalize_dependency_principal
- Remove #[expect(clippy::too_many_lines)] attributes no longer needed
- Add #[expect(clippy::missing_const_for_fn)] to bare() (Vec requires runtime)
- Fix import ordering in engine.rs per cargo fmt

No logic changes. All tests pass.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Introduce helper functions to reduce repetition and improve code quality:

normalize.rs changes:
- Add normalize_unary() helper for Not/Inside/Anywhere arms
- Add normalize_branches() helper for All/Any arms
- Refactor normalize_match to use helpers, reducing from 58 to 43 lines
- All helper functions properly handle Option<SourceSpan> ergonomics

normalization_tests.rs changes:
- Add normalize_legacy() and normalize_v2() test helpers
- Eliminate repeated 3-line boilerplate from all 11 tests
- Replace verbose match/panic! assertions with concise matches! macro
- Reduce file from 160 to 141 lines

No logic changes. All 33 tests pass. Clippy and formatting checks pass.
Addresses CodeScene Large Method and code duplication violations.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@leynos leynos changed the title Plan normalization into canonical Formula model Normalize legacy and v2 into canonical Formula with semantic checks Apr 17, 2026
codescene-delta-analysis[bot]

This comment was marked as outdated.

@leynos leynos marked this pull request as ready for review April 22, 2026 08:55
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 9 issues, and left some high level feedback:

  • In semantic_check.rs, the MissingPositiveTermInAnd diagnostic always uses the ambient span rather than the most specific available branch span, which makes error locations less precise than they could be; consider preferring the And node’s own span or one of its children’s spans when available.
  • In normalize_search_principal and the normalization helpers you currently never return an error but still wrap results in Result with #[expect(clippy::unnecessary_wraps)]; it would be cleaner either to introduce at least one real normalization error path now (e.g., for obviously malformed payloads) or to keep the API simple and return the bare Decorated<Formula> until you actually need fallible normalization.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `semantic_check.rs`, the `MissingPositiveTermInAnd` diagnostic always uses the ambient `span` rather than the most specific available branch span, which makes error locations less precise than they could be; consider preferring the `And` node’s own span or one of its children’s spans when available.
- In `normalize_search_principal` and the normalization helpers you currently never return an error but still wrap results in `Result` with `#[expect(clippy::unnecessary_wraps)]`; it would be cleaner either to introduce at least one real normalization error path now (e.g., for obviously malformed payloads) or to keep the API simple and return the bare `Decorated<Formula>` until you actually need fallible normalization.

## Individual Comments

### Comment 1
<location path="crates/sempai-core/src/formula.rs" line_range="40-49" />
<code_context>
+///     raw: json!({"metavariable-regex": {"metavariable": "$X", "regex": "foo.*"}}),
+/// };
+/// ```
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct WhereClause {
+    /// The raw JSON value of the constraint.
+    pub raw: Value,
</code_context>
<issue_to_address>
**issue (bug_risk):** Deriving `Eq` for `WhereClause` will fail because `serde_json::Value` does not implement `Eq`.

Because `serde_json::Value` only implements `PartialEq`, `#[derive(Eq)]` on `WhereClause` will not compile. Please either remove `Eq` (and any `WhereClause: Eq` bounds such as on `Decorated<T>`) or switch `raw` to a type that can soundly implement `Eq`.
</issue_to_address>

### Comment 2
<location path="crates/sempai/src/tests/engine_tests.rs" line_range="68" />
<code_context>

 #[test]
-fn compile_yaml_returns_not_implemented_after_successful_parse() {
+fn compile_yaml_normalizes_and_returns_query_plans() {
     let engine = default_engine();
     let result = engine.compile_yaml(
</code_context>
<issue_to_address>
**suggestion (testing):** Add tests that cover semantic validation errors surfaced through `compile_yaml` (e.g. invalid-not-in-or, missing-positive-term-in-and).

Current `compile_yaml_*` tests only cover successful cases. Since `compile_yaml` now calls `validate_formula`, please add tests that build YAML rules which normalize to:

- an `Or` with a negated branch, and
- an `And` with only constraint-style terms (`not`, `inside`, `anywhere`) and no positive term.

These should assert that `compile_yaml` returns an `Err`, and that the first diagnostic codes are `ESempaiInvalidNotInOr` and `ESempaiMissingPositiveTermInAnd`, to confirm semantic errors are exposed via the public API.

Suggested implementation:

```rust
#[test]
fn compile_yaml_normalizes_and_returns_query_plans() {
    let engine = default_engine();
    let result = engine.compile_yaml(
        "rules:\n  - id: demo.rule\n    message: oops\n    languages: [rust]\n    severity: ERROR\n    pattern: foo($X)\n",
    );
    let plans = result.expect("should successfully compile");
    assert_eq!(plans.len(), 1);
    let plan = plans.first().expect("should have first plan");
    assert_eq!(plan.rule_id(), "demo.rule");
    assert_eq!(plan.language(), Language::Rust);
}

#[test]
fn compile_yaml_returns_semantic_error_for_invalid_not_in_or() {
    let engine = default_engine();
    let result = engine.compile_yaml(
        "\
rules:
  - id: demo.invalid.not.in.or
    message: invalid not in or
    languages: [rust]
    severity: ERROR
    pattern: not foo($X) or bar($Y)
",
    );

    let error = result.expect_err("compile_yaml should surface semantic validation errors");
    let diagnostics = error.diagnostics();
    let first = diagnostics
        .first()
        .expect("semantic error should contain at least one diagnostic");
    assert_eq!(first.code(), "ESempaiInvalidNotInOr");
}

#[test]
fn compile_yaml_returns_semantic_error_for_missing_positive_term_in_and() {
    let engine = default_engine();
    let result = engine.compile_yaml(
        "\
rules:
  - id: demo.missing.positive.term.in.and
    message: missing positive term in and
    languages: [rust]
    severity: ERROR
    pattern: not foo($X) and inside(bar($Y))
",
    );

    let error = result.expect_err("compile_yaml should surface semantic validation errors");
    let diagnostics = error.diagnostics();
    let first = diagnostics
        .first()
        .expect("semantic error should contain at least one diagnostic");
    assert_eq!(first.code(), "ESempaiMissingPositiveTermInAnd");
}

```

To integrate this with the existing codebase you may need to:
1. Adjust how diagnostics are accessed from the error returned by `compile_yaml`. For example, replace `error.diagnostics()` and `first.code()` with the actual methods/fields used in your error/diagnostic types (e.g. `error.diagnostics`, `error.diagnostics()[0].code`, or `first.code.to_string()`).
2. Confirm the exact string representation or enum formatting of the diagnostic codes. If they are enums instead of strings, change the assertions to match the enum variant (e.g. `assert_eq!(first.code, DiagnosticCode::ESempaiInvalidNotInOr)`).
3. If your pattern language requires a different syntax to produce an `Or` with a negated branch or an `And` with only constraint-style terms, tweak the `pattern:` strings in the YAML to align with how `validate_formula` actually interprets formulas in your project.
</issue_to_address>

### Comment 3
<location path="crates/sempai/src/tests/normalization_tests.rs" line_range="9" />
<code_context>
+use crate::normalize::normalize_search_principal;
+
+/// Helper to normalize a legacy formula and extract the node.
+fn normalize_legacy(formula: LegacyFormula) -> Formula {
+    let principal = SearchQueryPrincipal::Legacy(formula);
+    normalize_search_principal(&principal, None)
</code_context>
<issue_to_address>
**suggestion (testing):** Add normalization tests for `patterns` (legacy `LegacyFormula::Patterns`) including constraint mapping into `where_clauses`.

Existing legacy normalization tests only cover simple `pattern` / `pattern-regex` and some negated/inside variants; they don’t exercise `LegacyFormula::Patterns`, where constraints are attached as `where_clauses` on the enclosing `And`. Please add a test that builds a `LegacyFormula::Patterns` with both formula clauses and one or more constraint clauses, and asserts that:

- the normalized node is a `Formula::And` with the expected child formulas, and
- the `Decorated` wrapper holds the expected `where_clauses` while all child formulas have empty `where_clauses`.

This will verify the constraint propagation logic in `normalize_legacy_patterns`.

Suggested implementation:

```rust
use sempai_core::formula::{Atom, Decorated, Formula, PatternAtom, RegexAtom};

```

```rust
/// Helper to normalize a legacy formula and extract the node.
fn normalize_legacy(formula: LegacyFormula) -> Formula {
    let principal = SearchQueryPrincipal::Legacy(formula);
    normalize_search_principal(&principal, None)
        .expect("should normalize")
        .node
}

/// Helper to normalize a legacy formula and keep the decorated wrapper.
///
/// This is useful for tests that need to inspect `where_clauses` or other
/// metadata attached to the normalized formula.
fn normalize_legacy_decorated(formula: LegacyFormula) -> Decorated<Formula> {
    let principal = SearchQueryPrincipal::Legacy(formula);
    normalize_search_principal(&principal, None).expect("should normalize")
}

/// Helper to normalize a v2 match formula and extract the node.
fn normalize_v2(formula: MatchFormula) -> Formula {
    let principal = SearchQueryPrincipal::Match(formula);
    normalize_search_principal(&principal, None)
        .expect("should normalize")
        .node
}

/// Normalization of `LegacyFormula::Patterns` should:
/// - produce an outer `Formula::And` node combining the pattern formulas, and
/// - attach any constraints as `where_clauses` on the outer `Decorated` node,
///   while all child formulas have empty `where_clauses`.
#[test]
fn normalize_legacy_patterns_propagates_constraints_to_where_clauses() {
    // NOTE: The exact constructors for `LegacyFormula::Pattern` / constraint
    // variants should match the rest of the tests in this module. If the
    // field names differ, adjust accordingly.
    //
    // Two simple pattern formulas:
    let pattern_formula_1 = LegacyFormula::Pattern {
        // Example pattern string; align with existing tests if they use
        // different patterns or additional fields.
        pattern: "foo($x)".to_string(),
        // Fill in any additional fields that the real `Pattern` variant
        // expects (e.g. language, is_regexp, etc.).
        ..Default::default()
    };

    let pattern_formula_2 = LegacyFormula::Pattern {
        pattern: "bar($x)".to_string(),
        ..Default::default()
    };

    // One or more constraint clauses that should end up as `where_clauses`
    // on the enclosing `And` in the normalized representation.
    //
    // This assumes there is some constraint-like legacy variant; adjust the
    // variant / fields to whatever is used elsewhere in the codebase for
    // capture constraints.
    let constraint_formula = LegacyFormula::Constraint {
        // Example: `$x` is constrained to some value.
        // Replace these fields with the actual ones from the enum.
        variable: "x".to_string(),
        value: LegacyValue::String("constrained".to_string()),
    };

    let legacy = LegacyFormula::Patterns {
        // These are the pattern formulas that should become the children
        // of the normalized `Formula::And`.
        patterns: vec![pattern_formula_1, pattern_formula_2],
        // These are the constraints that `normalize_legacy_patterns` is
        // expected to map into `where_clauses` on the enclosing `And`.
        constraints: vec![constraint_formula],
    };

    let decorated = normalize_legacy_decorated(legacy);

    // Outer node must be an `And` combining the pattern formulas.
    let children = match &decorated.node {
        Formula::And(children) => children,
        other => panic!("expected normalized legacy Patterns to be And, got {:?}", other),
    };

    // We expect exactly the two pattern children we provided above.
    assert_eq!(
        children.len(),
        2,
        "expected two child formulas in normalized And for LegacyFormula::Patterns"
    );

    // All constraints should be attached as `where_clauses` on the outer
    // decorated node.
    assert!(
        !decorated.where_clauses.is_empty(),
        "expected where_clauses on outer Decorated<Formula> for LegacyFormula::Patterns constraints"
    );

    // Child formulas themselves must not carry any where_clauses; they should
    // all be attached to the enclosing `And`.
    for (idx, child) in children.iter().enumerate() {
        assert!(
            child.where_clauses.is_empty(),
            "expected child formula {} of Patterns And to have empty where_clauses",
            idx
        );
    }
}

```

The above test assumes specific shapes for the `LegacyFormula` variants and `Decorated` type that may differ slightly from your actual implementation. To integrate cleanly you should:

1. **Adjust the `LegacyFormula` constructors**:
   - Replace `LegacyFormula::Pattern { pattern, ..Default::default() }` with the exact syntax used in your existing legacy normalization tests (same fields, same variant name).
   - Replace `LegacyFormula::Constraint { variable, value }` with the actual constraint variant used for legacy pattern constraints (for example it might be something like `LegacyFormula::Where`, `LegacyFormula::CaptureConstraint`, or similar).
   - Update the `LegacyFormula::Patterns { patterns, constraints }` fields (`patterns`, `constraints`) to match the real field names for the `Patterns` variant.

2. **Update the `Decorated` type usage**:
   - If `Decorated` lives in a different module or is re-exported under a different path, update the import (`use sempai_core::formula::{Atom, Decorated, Formula, PatternAtom, RegexAtom};`) accordingly.
   - If the fields on `Decorated` are named differently (e.g. `node` vs `formula`, `where_clauses` vs `constraints`), adjust the helper `normalize_legacy_decorated` return type and the test field accesses to match.

3. **Ensure `Default::default()` is valid**:
   - If the `LegacyFormula::Pattern` variant does not support the struct-update `..Default::default()` (e.g. it’s not `#[derive(Default)]` or doesn’t use struct syntax), replace those with complete explicit construction as used in the existing tests in this file.

4. **Align where-clause type checks**:
   - If `where_clauses` is not a `Vec<_>` or has a different accessor, update the `assert!(!decorated.where_clauses.is_empty())` and child `where_clauses` assertions to match the real API.
</issue_to_address>

### Comment 4
<location path="crates/sempai/src/tests/normalization_tests.rs" line_range="80" />
<code_context>
+}
+
+#[test]
+fn v2_pattern_shorthand_normalizes_to_atom() {
+    let result = normalize_v2(MatchFormula::Pattern(String::from("bar($Y)")));
+    assert_eq!(
</code_context>
<issue_to_address>
**suggestion (testing):** Extend v2 normalization tests to cover `MatchFormula::Decorated` metadata (where/as/fix) and span propagation.

Current v2 tests only cover the structural mapping for `pattern`/`regex`/`all`/`any`/`not`/`inside`/`anywhere`, not the `MatchFormula::Decorated { .. }` variant. Since `normalize_match` maps `where` into `where_clauses` and copies `as_name`/`fix` onto `Decorated`, please add a test that builds a decorated formula with non-empty `where`/`as`/`fix` and asserts those fields are preserved. It would also be helpful to add a test that calls `normalize_search_principal` with a non-`None` `rule_span` and verifies the resulting `span` is set, to cover span propagation through normalization.

Suggested implementation:

```rust
#[test]
fn v2_decorated_metadata_is_preserved() {
    // Build a decorated match formula with non-empty where/as/fix metadata
    let inner = MatchFormula::Pattern(String::from("foo($X)"));
    let mf = MatchFormula::Decorated {
        formula: Box::new(inner),
        // Use a minimal where-clause that your existing types support
        // (e.g. WhereClause::Raw, WhereClause::Expr, etc.)
        where_clauses: vec![WhereClause::Raw(String::from("foo == bar"))],
        as_name: Some(String::from("my_capture")),
        fix: Some(String::from("apply_fix")),
        span: None,
    };

    let result = normalize_v2(mf);

    match result {
        Formula::Decorated(decorated) => {
            assert_eq!(decorated.as_name.as_deref(), Some("my_capture"));
            assert_eq!(decorated.fix.as_deref(), Some("apply_fix"));
            assert_eq!(decorated.where_clauses.len(), 1);
            // Adjust this assertion if WhereClause doesn't implement Display/to_string
            assert_eq!(decorated.where_clauses[0].to_string(), "foo == bar");
        }
        _ => panic!("expected Decorated formula"),
    }
}

#[test]
fn v2_span_is_propagated_from_search_principal() {
    // Build a simple principal with a pattern formula
    let principal = SearchPrincipal {
        formula: MatchFormula::Pattern(String::from("foo($X)")),
        span: None,
        // Add any additional required fields here to satisfy the struct definition.
    };

    // Use whatever span type/constructor your codebase exposes for rule spans
    let rule_span = Some(Span::from_offset(0, 10));
    let normalized = normalize_search_principal(rule_span.clone(), principal);

    assert_eq!(normalized.span, rule_span);
}

#[test]
fn legacy_pattern_regex_normalizes_to_regex_atom() {

```

Because only a fragment of the file and type definitions is visible, you will need to align the new tests with your existing types:

1. **WhereClause construction**
   - Replace `WhereClause::Raw(String::from("foo == bar"))` with the concrete variant/constructor your codebase uses to represent a simple `where` clause.  
   - If `WhereClause` does not implement `Display`, replace `decorated.where_clauses[0].to_string()` with a field-level assertion that matches how `normalize_match` stores the textual predicate (e.g. `decorated.where_clauses[0].expr.source` or similar).

2. **Formula / Decorated shape**
   - If your normalized type is something like `Formula::Decorated(DecoratedFormula { formula, where_clauses, as_name, fix, .. })`, update the pattern in the `match` arm to match the real struct/field names.  
   - If `normalize_v2` returns a wrapper (e.g. `NormalizedMatch`) instead of `Formula` directly, destructure that wrapper first and then match on its `formula` field.

3. **SearchPrincipal and span type**
   - Replace the `SearchPrincipal` literal with the real struct fields required in your codebase (e.g. `rule`, `language`, etc.).  
   - Replace `Span::from_offset(0, 10)` with the constructor or literal you normally use for rule spans (e.g. `Span { start: 0, end: 10 }`, `RuleSpan::new(0, 10)`, or `TextRange::new(...)`).  
   - Adjust the field name `span` on `SearchPrincipal` and `normalized` if your struct uses a different name (e.g. `rule_span` or `range`).

4. **Import paths**
   - Ensure the test module has `use` statements for `MatchFormula`, `Formula`, `WhereClause`, `SearchPrincipal`, `Span` (or your span type), `normalize_v2`, and `normalize_search_principal` from the correct modules.
</issue_to_address>

### Comment 5
<location path="crates/sempai/src/tests/normalization_tests.rs" line_range="61" />
<code_context>
+}
+
+#[test]
+fn legacy_pattern_not_inside_normalizes_to_not_inside() {
+    let result = normalize_legacy(LegacyFormula::PatternNotInside(Box::new(
+        LegacyValue::String(String::from("class Foo:")),
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding tests for legacy `pattern-either` and `semgrep-internal-pattern-anywhere` normalization.

Current legacy tests don’t cover `LegacyFormula::PatternEither` or `LegacyFormula::Anywhere` (for `semgrep-internal-pattern-anywhere`). Please add tests that
- verify `pattern-either: [...]` normalizes to `Formula::Or` with the correct children, and
- verify `semgrep-internal-pattern-anywhere` normalizes to `Formula::Anywhere` wrapping the child formula,
so the legacy-to-canonical mapping is fully exercised.

Suggested implementation:

```rust
#[test]
fn legacy_pattern_not_inside_normalizes_to_not_inside() {
    let result = normalize_legacy(LegacyFormula::PatternNotInside(Box::new(
        LegacyValue::String(String::from("class Foo:")),
    )));
    match result {
        Formula::Not(not_inner) => match not_inner.node {
            Formula::Inside(inside_inner) => match inside_inner.node {
                Formula::Atom(Atom::Pattern(pat)) => {
                    assert_eq!(pat.text, "class Foo:");
                }
                _ => panic!("expected Pattern atom"),
            },
            _ => panic!("expected Inside formula inside Not"),
        },
        _ => panic!("expected Not formula"),
    }
}

#[test]
fn legacy_pattern_either_normalizes_to_or() {
    let result = normalize_legacy(LegacyFormula::PatternEither(vec![
        LegacyFormula::Pattern(Box::new(LegacyValue::String(String::from("first pattern")))),
        LegacyFormula::Pattern(Box::new(LegacyValue::String(String::from("second pattern")))),
    ]));

    match result {
        Formula::Or(children) => {
            assert_eq!(children.len(), 2, "expected two children in Or formula");

            match &children[0].node {
                Formula::Atom(Atom::Pattern(pat)) => {
                    assert_eq!(pat.text, "first pattern");
                }
                _ => panic!("expected first child of Or to be a Pattern atom"),
            }

            match &children[1].node {
                Formula::Atom(Atom::Pattern(pat)) => {
                    assert_eq!(pat.text, "second pattern");
                }
                _ => panic!("expected second child of Or to be a Pattern atom"),
            }
        }
        _ => panic!("expected Or formula from legacy pattern-either"),
    }
}

#[test]
fn legacy_anywhere_normalizes_to_anywhere() {
    // This corresponds to legacy `semgrep-internal-pattern-anywhere` with a single pattern child.
    let result = normalize_legacy(LegacyFormula::Anywhere(Box::new(
        LegacyFormula::Pattern(Box::new(LegacyValue::String(String::from(
            "pattern anywhere",
        )))),
    )));

    match result {
        Formula::Anywhere(inner) => match inner.node {
            Formula::Atom(Atom::Pattern(pat)) => {
                assert_eq!(pat.text, "pattern anywhere");
            }
            _ => panic!("expected Pattern atom inside Anywhere"),
        },
        _ => panic!("expected Anywhere formula from semgrep-internal-pattern-anywhere"),

```

The above edits assume the following shapes, which you should confirm in your codebase and adjust if necessary:
1. `LegacyFormula::PatternEither` takes a `Vec<LegacyFormula>`. If it instead takes a `Vec<LegacyValue>` or a different wrapper type, wrap the elements accordingly in the test.
2. `LegacyFormula::Pattern` is constructed as `LegacyFormula::Pattern(Box<LegacyValue::String(...)>)`. If the constructor signature differs, update the test calls.
3. `LegacyFormula::Anywhere` is defined as wrapping a `Box<LegacyFormula>`. If it instead wraps a `LegacyValue` directly, change the argument to `Anywhere` to match.
4. `Formula::Or` is assumed to contain a `Vec<Spanned<Formula>>` (or similar) where you access the formula via `.node`. If its inner type is different, update the pattern match on `Formula::Or(children)` and indexing of `children[0]` / `children[1]` accordingly.
Once these shapes are confirmed and aligned, the tests will fully cover normalization from legacy `pattern-either` and `semgrep-internal-pattern-anywhere` to `Formula::Or` and `Formula::Anywhere`, respectively.
</issue_to_address>

### Comment 6
<location path="crates/sempai/src/tests/semantic_validation_tests.rs" line_range="21" />
<code_context>
+}
+
+#[test]
+fn valid_or_with_positive_branches_passes() {
+    let formula = Decorated {
+        node: Formula::Or(vec![make_pattern("foo"), make_pattern("bar")]),
</code_context>
<issue_to_address>
**suggestion (testing):** Add a sanity test that a single positive atom formula passes validation untouched.

Current tests cover `And`/`Or` combinations, but not the simplest case. Please add a test where the root is a single `Atom` (e.g. `make_pattern("foo")`) and assert that `validate_formula` returns `Ok(())` to guard against future regressions on trivially valid formulas.
</issue_to_address>

### Comment 7
<location path="docs/execplans/4-1-5-normalization-into-canonical-formula-model.md" line_range="168" />
<code_context>
+produces `LegacyFormula` and `MatchFormula` ASTs in `sempai_yaml`, which are
+structurally distinct but semantically equivalent for the subset they share.
+
+The normalisation step must lower both representations into the design
+document's canonical `Formula` enum, then run semantic constraint checks before
+constructing a `QueryPlan`.
</code_context>
<issue_to_address>
**nitpick (typo):** Align spelling of "normalize/normalisation" within this document.

This section uses "normalisation" while earlier parts (including the title) use "normalize/normalization". Please choose one variant (likely the one from the title) and use it consistently in this file.

Suggested implementation:

```
The normalization step must lower both representations into the design

```

Search the rest of `docs/execplans/4-1-5-normalization-into-canonical-formula-model.md` for other variants like `normalisation`, `normalise`, and replace them with `normalization`, `normalize` respectively to keep spelling consistent with the title and earlier sections.
</issue_to_address>

### Comment 8
<location path="crates/sempai/src/normalize.rs" line_range="1" />
<code_context>
+//! Normalization of parsed query syntax into canonical formula model.
+//!
+//! This module provides the transformation from parsed
</code_context>
<issue_to_address>
**issue (review_instructions):** Add unit and behavioural tests for the normalization pipeline; the new normalization logic is currently untested.

This module implements the core normalization logic from `SearchQueryPrincipal` to the canonical `Formula` model, including legacy/v2 mappings and special handling for `r2c-internal-project-depends-on`. There are no unit tests in `crates/sempai/src/tests/normalization_tests.rs` (the file is empty in this diff) and no visible BDD coverage for the happy, unhappy, and edge paths you describe in the ExecPlan. Add unit tests that cover all branches of `normalize_legacy`, `normalize_legacy_patterns`, `normalize_match`, and `normalize_dependency_principal`, and add `rstest-bdd` feature scenarios that exercise normalization end-to-end via the public engine API.

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/*`

**Instructions:**
- For any new feature or change to an existing feature, both behavioural *and* unit tests are required.
- Tests must use `rstest-bdd` v0.5.0 for BDD scenarios covering happy, unhappy, and edge paths.

</details>
</issue_to_address>

### Comment 9
<location path="docs/execplans/4-1-5-normalization-into-canonical-formula-model.md" line_range="163" />
<code_context>
+
+## Context and orientation
+
+The Sempai query pipeline currently stops at a `NOT_IMPLEMENTED` placeholder
+after successfully parsing and mode-validating a YAML rule file.  The parser
+produces `LegacyFormula` and `MatchFormula` ASTs in `sempai_yaml`, which are
</code_context>
<issue_to_address>
**suggestion (review_instructions):** ASTs are referenced later in this section without defining the acronym.

In the paragraph beginning "The parser produces `LegacyFormula` and `MatchFormula` ASTs", please expand AST on first use, for example "abstract syntax trees (ASTs)".

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/*.md`

**Instructions:**
Define uncommon acronyms on first use.

</details>
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +40 to +49
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum Formula {
/// A leaf pattern or regex atom.
Atom(Atom),
/// Negation: the inner formula must not match.
Not(Box<Decorated<Formula>>),
/// Context constraint: the anchor must be inside a match of the inner.
Inside(Box<Decorated<Formula>>),
/// Context constraint: the inner must match somewhere in scope.
Anywhere(Box<Decorated<Formula>>),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Deriving Eq for WhereClause will fail because serde_json::Value does not implement Eq.

Because serde_json::Value only implements PartialEq, #[derive(Eq)] on WhereClause will not compile. Please either remove Eq (and any WhereClause: Eq bounds such as on Decorated<T>) or switch raw to a type that can soundly implement Eq.

Comment thread crates/sempai/src/tests/engine_tests.rs Outdated

#[test]
fn compile_yaml_returns_not_implemented_after_successful_parse() {
fn compile_yaml_normalizes_and_returns_query_plans() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add tests that cover semantic validation errors surfaced through compile_yaml (e.g. invalid-not-in-or, missing-positive-term-in-and).

Current compile_yaml_* tests only cover successful cases. Since compile_yaml now calls validate_formula, please add tests that build YAML rules which normalize to:

  • an Or with a negated branch, and
  • an And with only constraint-style terms (not, inside, anywhere) and no positive term.

These should assert that compile_yaml returns an Err, and that the first diagnostic codes are ESempaiInvalidNotInOr and ESempaiMissingPositiveTermInAnd, to confirm semantic errors are exposed via the public API.

Suggested implementation:

#[test]
fn compile_yaml_normalizes_and_returns_query_plans() {
    let engine = default_engine();
    let result = engine.compile_yaml(
        "rules:\n  - id: demo.rule\n    message: oops\n    languages: [rust]\n    severity: ERROR\n    pattern: foo($X)\n",
    );
    let plans = result.expect("should successfully compile");
    assert_eq!(plans.len(), 1);
    let plan = plans.first().expect("should have first plan");
    assert_eq!(plan.rule_id(), "demo.rule");
    assert_eq!(plan.language(), Language::Rust);
}

#[test]
fn compile_yaml_returns_semantic_error_for_invalid_not_in_or() {
    let engine = default_engine();
    let result = engine.compile_yaml(
        "\
rules:
  - id: demo.invalid.not.in.or
    message: invalid not in or
    languages: [rust]
    severity: ERROR
    pattern: not foo($X) or bar($Y)
",
    );

    let error = result.expect_err("compile_yaml should surface semantic validation errors");
    let diagnostics = error.diagnostics();
    let first = diagnostics
        .first()
        .expect("semantic error should contain at least one diagnostic");
    assert_eq!(first.code(), "ESempaiInvalidNotInOr");
}

#[test]
fn compile_yaml_returns_semantic_error_for_missing_positive_term_in_and() {
    let engine = default_engine();
    let result = engine.compile_yaml(
        "\
rules:
  - id: demo.missing.positive.term.in.and
    message: missing positive term in and
    languages: [rust]
    severity: ERROR
    pattern: not foo($X) and inside(bar($Y))
",
    );

    let error = result.expect_err("compile_yaml should surface semantic validation errors");
    let diagnostics = error.diagnostics();
    let first = diagnostics
        .first()
        .expect("semantic error should contain at least one diagnostic");
    assert_eq!(first.code(), "ESempaiMissingPositiveTermInAnd");
}

To integrate this with the existing codebase you may need to:

  1. Adjust how diagnostics are accessed from the error returned by compile_yaml. For example, replace error.diagnostics() and first.code() with the actual methods/fields used in your error/diagnostic types (e.g. error.diagnostics, error.diagnostics()[0].code, or first.code.to_string()).
  2. Confirm the exact string representation or enum formatting of the diagnostic codes. If they are enums instead of strings, change the assertions to match the enum variant (e.g. assert_eq!(first.code, DiagnosticCode::ESempaiInvalidNotInOr)).
  3. If your pattern language requires a different syntax to produce an Or with a negated branch or an And with only constraint-style terms, tweak the pattern: strings in the YAML to align with how validate_formula actually interprets formulas in your project.

use crate::normalize::normalize_search_principal;

/// Helper to normalize a legacy formula and extract the node.
fn normalize_legacy(formula: LegacyFormula) -> Formula {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add normalization tests for patterns (legacy LegacyFormula::Patterns) including constraint mapping into where_clauses.

Existing legacy normalization tests only cover simple pattern / pattern-regex and some negated/inside variants; they don’t exercise LegacyFormula::Patterns, where constraints are attached as where_clauses on the enclosing And. Please add a test that builds a LegacyFormula::Patterns with both formula clauses and one or more constraint clauses, and asserts that:

  • the normalized node is a Formula::And with the expected child formulas, and
  • the Decorated wrapper holds the expected where_clauses while all child formulas have empty where_clauses.

This will verify the constraint propagation logic in normalize_legacy_patterns.

Suggested implementation:

use sempai_core::formula::{Atom, Decorated, Formula, PatternAtom, RegexAtom};
/// Helper to normalize a legacy formula and extract the node.
fn normalize_legacy(formula: LegacyFormula) -> Formula {
    let principal = SearchQueryPrincipal::Legacy(formula);
    normalize_search_principal(&principal, None)
        .expect("should normalize")
        .node
}

/// Helper to normalize a legacy formula and keep the decorated wrapper.
///
/// This is useful for tests that need to inspect `where_clauses` or other
/// metadata attached to the normalized formula.
fn normalize_legacy_decorated(formula: LegacyFormula) -> Decorated<Formula> {
    let principal = SearchQueryPrincipal::Legacy(formula);
    normalize_search_principal(&principal, None).expect("should normalize")
}

/// Helper to normalize a v2 match formula and extract the node.
fn normalize_v2(formula: MatchFormula) -> Formula {
    let principal = SearchQueryPrincipal::Match(formula);
    normalize_search_principal(&principal, None)
        .expect("should normalize")
        .node
}

/// Normalization of `LegacyFormula::Patterns` should:
/// - produce an outer `Formula::And` node combining the pattern formulas, and
/// - attach any constraints as `where_clauses` on the outer `Decorated` node,
///   while all child formulas have empty `where_clauses`.
#[test]
fn normalize_legacy_patterns_propagates_constraints_to_where_clauses() {
    // NOTE: The exact constructors for `LegacyFormula::Pattern` / constraint
    // variants should match the rest of the tests in this module. If the
    // field names differ, adjust accordingly.
    //
    // Two simple pattern formulas:
    let pattern_formula_1 = LegacyFormula::Pattern {
        // Example pattern string; align with existing tests if they use
        // different patterns or additional fields.
        pattern: "foo($x)".to_string(),
        // Fill in any additional fields that the real `Pattern` variant
        // expects (e.g. language, is_regexp, etc.).
        ..Default::default()
    };

    let pattern_formula_2 = LegacyFormula::Pattern {
        pattern: "bar($x)".to_string(),
        ..Default::default()
    };

    // One or more constraint clauses that should end up as `where_clauses`
    // on the enclosing `And` in the normalized representation.
    //
    // This assumes there is some constraint-like legacy variant; adjust the
    // variant / fields to whatever is used elsewhere in the codebase for
    // capture constraints.
    let constraint_formula = LegacyFormula::Constraint {
        // Example: `$x` is constrained to some value.
        // Replace these fields with the actual ones from the enum.
        variable: "x".to_string(),
        value: LegacyValue::String("constrained".to_string()),
    };

    let legacy = LegacyFormula::Patterns {
        // These are the pattern formulas that should become the children
        // of the normalized `Formula::And`.
        patterns: vec![pattern_formula_1, pattern_formula_2],
        // These are the constraints that `normalize_legacy_patterns` is
        // expected to map into `where_clauses` on the enclosing `And`.
        constraints: vec![constraint_formula],
    };

    let decorated = normalize_legacy_decorated(legacy);

    // Outer node must be an `And` combining the pattern formulas.
    let children = match &decorated.node {
        Formula::And(children) => children,
        other => panic!("expected normalized legacy Patterns to be And, got {:?}", other),
    };

    // We expect exactly the two pattern children we provided above.
    assert_eq!(
        children.len(),
        2,
        "expected two child formulas in normalized And for LegacyFormula::Patterns"
    );

    // All constraints should be attached as `where_clauses` on the outer
    // decorated node.
    assert!(
        !decorated.where_clauses.is_empty(),
        "expected where_clauses on outer Decorated<Formula> for LegacyFormula::Patterns constraints"
    );

    // Child formulas themselves must not carry any where_clauses; they should
    // all be attached to the enclosing `And`.
    for (idx, child) in children.iter().enumerate() {
        assert!(
            child.where_clauses.is_empty(),
            "expected child formula {} of Patterns And to have empty where_clauses",
            idx
        );
    }
}

The above test assumes specific shapes for the LegacyFormula variants and Decorated type that may differ slightly from your actual implementation. To integrate cleanly you should:

  1. Adjust the LegacyFormula constructors:

    • Replace LegacyFormula::Pattern { pattern, ..Default::default() } with the exact syntax used in your existing legacy normalization tests (same fields, same variant name).
    • Replace LegacyFormula::Constraint { variable, value } with the actual constraint variant used for legacy pattern constraints (for example it might be something like LegacyFormula::Where, LegacyFormula::CaptureConstraint, or similar).
    • Update the LegacyFormula::Patterns { patterns, constraints } fields (patterns, constraints) to match the real field names for the Patterns variant.
  2. Update the Decorated type usage:

    • If Decorated lives in a different module or is re-exported under a different path, update the import (use sempai_core::formula::{Atom, Decorated, Formula, PatternAtom, RegexAtom};) accordingly.
    • If the fields on Decorated are named differently (e.g. node vs formula, where_clauses vs constraints), adjust the helper normalize_legacy_decorated return type and the test field accesses to match.
  3. Ensure Default::default() is valid:

    • If the LegacyFormula::Pattern variant does not support the struct-update ..Default::default() (e.g. it’s not #[derive(Default)] or doesn’t use struct syntax), replace those with complete explicit construction as used in the existing tests in this file.
  4. Align where-clause type checks:

    • If where_clauses is not a Vec<_> or has a different accessor, update the assert!(!decorated.where_clauses.is_empty()) and child where_clauses assertions to match the real API.

}

#[test]
fn v2_pattern_shorthand_normalizes_to_atom() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Extend v2 normalization tests to cover MatchFormula::Decorated metadata (where/as/fix) and span propagation.

Current v2 tests only cover the structural mapping for pattern/regex/all/any/not/inside/anywhere, not the MatchFormula::Decorated { .. } variant. Since normalize_match maps where into where_clauses and copies as_name/fix onto Decorated, please add a test that builds a decorated formula with non-empty where/as/fix and asserts those fields are preserved. It would also be helpful to add a test that calls normalize_search_principal with a non-None rule_span and verifies the resulting span is set, to cover span propagation through normalization.

Suggested implementation:

#[test]
fn v2_decorated_metadata_is_preserved() {
    // Build a decorated match formula with non-empty where/as/fix metadata
    let inner = MatchFormula::Pattern(String::from("foo($X)"));
    let mf = MatchFormula::Decorated {
        formula: Box::new(inner),
        // Use a minimal where-clause that your existing types support
        // (e.g. WhereClause::Raw, WhereClause::Expr, etc.)
        where_clauses: vec![WhereClause::Raw(String::from("foo == bar"))],
        as_name: Some(String::from("my_capture")),
        fix: Some(String::from("apply_fix")),
        span: None,
    };

    let result = normalize_v2(mf);

    match result {
        Formula::Decorated(decorated) => {
            assert_eq!(decorated.as_name.as_deref(), Some("my_capture"));
            assert_eq!(decorated.fix.as_deref(), Some("apply_fix"));
            assert_eq!(decorated.where_clauses.len(), 1);
            // Adjust this assertion if WhereClause doesn't implement Display/to_string
            assert_eq!(decorated.where_clauses[0].to_string(), "foo == bar");
        }
        _ => panic!("expected Decorated formula"),
    }
}

#[test]
fn v2_span_is_propagated_from_search_principal() {
    // Build a simple principal with a pattern formula
    let principal = SearchPrincipal {
        formula: MatchFormula::Pattern(String::from("foo($X)")),
        span: None,
        // Add any additional required fields here to satisfy the struct definition.
    };

    // Use whatever span type/constructor your codebase exposes for rule spans
    let rule_span = Some(Span::from_offset(0, 10));
    let normalized = normalize_search_principal(rule_span.clone(), principal);

    assert_eq!(normalized.span, rule_span);
}

#[test]
fn legacy_pattern_regex_normalizes_to_regex_atom() {

Because only a fragment of the file and type definitions is visible, you will need to align the new tests with your existing types:

  1. WhereClause construction

    • Replace WhereClause::Raw(String::from("foo == bar")) with the concrete variant/constructor your codebase uses to represent a simple where clause.
    • If WhereClause does not implement Display, replace decorated.where_clauses[0].to_string() with a field-level assertion that matches how normalize_match stores the textual predicate (e.g. decorated.where_clauses[0].expr.source or similar).
  2. Formula / Decorated shape

    • If your normalized type is something like Formula::Decorated(DecoratedFormula { formula, where_clauses, as_name, fix, .. }), update the pattern in the match arm to match the real struct/field names.
    • If normalize_v2 returns a wrapper (e.g. NormalizedMatch) instead of Formula directly, destructure that wrapper first and then match on its formula field.
  3. SearchPrincipal and span type

    • Replace the SearchPrincipal literal with the real struct fields required in your codebase (e.g. rule, language, etc.).
    • Replace Span::from_offset(0, 10) with the constructor or literal you normally use for rule spans (e.g. Span { start: 0, end: 10 }, RuleSpan::new(0, 10), or TextRange::new(...)).
    • Adjust the field name span on SearchPrincipal and normalized if your struct uses a different name (e.g. rule_span or range).
  4. Import paths

    • Ensure the test module has use statements for MatchFormula, Formula, WhereClause, SearchPrincipal, Span (or your span type), normalize_v2, and normalize_search_principal from the correct modules.

}

#[test]
fn legacy_pattern_not_inside_normalizes_to_not_inside() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Consider adding tests for legacy pattern-either and semgrep-internal-pattern-anywhere normalization.

Current legacy tests don’t cover LegacyFormula::PatternEither or LegacyFormula::Anywhere (for semgrep-internal-pattern-anywhere). Please add tests that

  • verify pattern-either: [...] normalizes to Formula::Or with the correct children, and
  • verify semgrep-internal-pattern-anywhere normalizes to Formula::Anywhere wrapping the child formula,
    so the legacy-to-canonical mapping is fully exercised.

Suggested implementation:

#[test]
fn legacy_pattern_not_inside_normalizes_to_not_inside() {
    let result = normalize_legacy(LegacyFormula::PatternNotInside(Box::new(
        LegacyValue::String(String::from("class Foo:")),
    )));
    match result {
        Formula::Not(not_inner) => match not_inner.node {
            Formula::Inside(inside_inner) => match inside_inner.node {
                Formula::Atom(Atom::Pattern(pat)) => {
                    assert_eq!(pat.text, "class Foo:");
                }
                _ => panic!("expected Pattern atom"),
            },
            _ => panic!("expected Inside formula inside Not"),
        },
        _ => panic!("expected Not formula"),
    }
}

#[test]
fn legacy_pattern_either_normalizes_to_or() {
    let result = normalize_legacy(LegacyFormula::PatternEither(vec![
        LegacyFormula::Pattern(Box::new(LegacyValue::String(String::from("first pattern")))),
        LegacyFormula::Pattern(Box::new(LegacyValue::String(String::from("second pattern")))),
    ]));

    match result {
        Formula::Or(children) => {
            assert_eq!(children.len(), 2, "expected two children in Or formula");

            match &children[0].node {
                Formula::Atom(Atom::Pattern(pat)) => {
                    assert_eq!(pat.text, "first pattern");
                }
                _ => panic!("expected first child of Or to be a Pattern atom"),
            }

            match &children[1].node {
                Formula::Atom(Atom::Pattern(pat)) => {
                    assert_eq!(pat.text, "second pattern");
                }
                _ => panic!("expected second child of Or to be a Pattern atom"),
            }
        }
        _ => panic!("expected Or formula from legacy pattern-either"),
    }
}

#[test]
fn legacy_anywhere_normalizes_to_anywhere() {
    // This corresponds to legacy `semgrep-internal-pattern-anywhere` with a single pattern child.
    let result = normalize_legacy(LegacyFormula::Anywhere(Box::new(
        LegacyFormula::Pattern(Box::new(LegacyValue::String(String::from(
            "pattern anywhere",
        )))),
    )));

    match result {
        Formula::Anywhere(inner) => match inner.node {
            Formula::Atom(Atom::Pattern(pat)) => {
                assert_eq!(pat.text, "pattern anywhere");
            }
            _ => panic!("expected Pattern atom inside Anywhere"),
        },
        _ => panic!("expected Anywhere formula from semgrep-internal-pattern-anywhere"),

The above edits assume the following shapes, which you should confirm in your codebase and adjust if necessary:

  1. LegacyFormula::PatternEither takes a Vec<LegacyFormula>. If it instead takes a Vec<LegacyValue> or a different wrapper type, wrap the elements accordingly in the test.
  2. LegacyFormula::Pattern is constructed as LegacyFormula::Pattern(Box<LegacyValue::String(...)>). If the constructor signature differs, update the test calls.
  3. LegacyFormula::Anywhere is defined as wrapping a Box<LegacyFormula>. If it instead wraps a LegacyValue directly, change the argument to Anywhere to match.
  4. Formula::Or is assumed to contain a Vec<Spanned<Formula>> (or similar) where you access the formula via .node. If its inner type is different, update the pattern match on Formula::Or(children) and indexing of children[0] / children[1] accordingly.
    Once these shapes are confirmed and aligned, the tests will fully cover normalization from legacy pattern-either and semgrep-internal-pattern-anywhere to Formula::Or and Formula::Anywhere, respectively.

}

#[test]
fn valid_or_with_positive_branches_passes() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add a sanity test that a single positive atom formula passes validation untouched.

Current tests cover And/Or combinations, but not the simplest case. Please add a test where the root is a single Atom (e.g. make_pattern("foo")) and assert that validate_formula returns Ok(()) to guard against future regressions on trivially valid formulas.

Comment thread docs/execplans/4-1-5-normalization-into-canonical-formula-model.md Outdated
Comment thread crates/sempai/src/normalize.rs

## Context and orientation

The Sempai query pipeline currently stops at a `NOT_IMPLEMENTED` placeholder
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (review_instructions): ASTs are referenced later in this section without defining the acronym.

In the paragraph beginning "The parser produces LegacyFormula and MatchFormula ASTs", please expand AST on first use, for example "abstract syntax trees (ASTs)".

Review instructions:

Path patterns: **/*.md

Instructions:
Define uncommon acronyms on first use.

@coderabbitai coderabbitai Bot added the Roadmap label Apr 22, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b41c038c28

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +148 to +152
file.rules()
.iter()
.filter_map(|rule| {
if let RulePrincipal::Search(principal) = rule.principal() {
Some((rule, principal))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Align BDD scenarios with new compile_yaml success path

This change makes compile_yaml return Ok(Vec<QueryPlan>) for valid search rules, but the behavioural scenarios in crates/sempai/tests/features/sempai_engine.feature still expect NOT_IMPLEMENTED for valid YAML and r2c-internal-project-depends-on. With this new path, those scenarios now hit then_compilation_fails on successful results and fail the behaviour test suite, so CI will break unless the feature expectations are updated to assert successful compilation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/sempai/src/tests/engine_tests.rs (1)

68-98: 🛠️ Refactor suggestion | 🟠 Major

Collapse duplicated compile-yaml normalisation tests into one parameterized test.

Unify the two near-identical tests into a single #[rstest] case table and keep one assertion body.

♻️ Proposed refactor
+use rstest::rstest;
 use crate::engine::QueryPlan;
 use crate::{
     Diagnostic, DiagnosticCode, DiagnosticReport, Engine, EngineConfig, EngineLimits, Language,
 };
 use sempai_core::formula::{Atom, Decorated, Formula, PatternAtom};
@@
-#[test]
-fn compile_yaml_normalizes_and_returns_query_plans() {
-    let engine = default_engine();
-    let result = engine.compile_yaml(
-        "rules:\n  - id: demo.rule\n    message: oops\n    languages: [rust]\n    severity: ERROR\n    pattern: foo($X)\n",
-    );
-    let plans = result.expect("should successfully compile");
-    assert_eq!(plans.len(), 1);
-    let plan = plans.first().expect("should have first plan");
-    assert_eq!(plan.rule_id(), "demo.rule");
-    assert_eq!(plan.language(), Language::Rust);
-}
-
-#[test]
-fn compile_yaml_normalizes_project_depends_on_search_rule() {
-    let engine = default_engine();
-    let result = engine.compile_yaml(concat!(
-        "rules:\n",
-        "  - id: demo.depends\n",
-        "    message: detect vulnerable dependency\n",
-        "    languages: [python]\n",
-        "    severity: WARNING\n",
-        "    r2c-internal-project-depends-on:\n",
-        "      namespace: pypi\n",
-        "      package: requests\n",
-    ));
-    let plans = result.expect("should successfully compile");
-    assert_eq!(plans.len(), 1);
-    let plan = plans.first().expect("should have first plan");
-    assert_eq!(plan.rule_id(), "demo.depends");
-    assert_eq!(plan.language(), Language::Python);
-}
+#[rstest]
+#[case(
+    "rules:\n  - id: demo.rule\n    message: oops\n    languages: [rust]\n    severity: ERROR\n    pattern: foo($X)\n",
+    "demo.rule",
+    Language::Rust
+)]
+#[case(
+    concat!(
+        "rules:\n",
+        "  - id: demo.depends\n",
+        "    message: detect vulnerable dependency\n",
+        "    languages: [python]\n",
+        "    severity: WARNING\n",
+        "    r2c-internal-project-depends-on:\n",
+        "      namespace: pypi\n",
+        "      package: requests\n",
+    ),
+    "demo.depends",
+    Language::Python
+)]
+fn compile_yaml_normalizes_and_returns_query_plans(
+    #[case] yaml: &str,
+    #[case] expected_rule_id: &str,
+    #[case] expected_language: Language,
+) {
+    let engine = default_engine();
+    let plans = engine
+        .compile_yaml(yaml)
+        .expect("should successfully compile");
+    assert_eq!(plans.len(), 1);
+    let plan = plans.first().expect("should have first plan");
+    assert_eq!(plan.rule_id(), expected_rule_id);
+    assert_eq!(plan.language(), expected_language);
+}

As per coding guidelines, "Use rstest fixtures for shared setup in Rust tests" and "Replace duplicated tests with #[rstest(...)] parameterised cases in Rust."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/sempai/src/tests/engine_tests.rs` around lines 68 - 98, Combine the
two tests compile_yaml_normalizes_and_returns_query_plans and
compile_yaml_normalizes_project_depends_on_search_rule into one parameterized
rstest that iterates cases of (yaml, expected_rule_id, expected_language); reuse
the same default_engine(), call engine.compile_yaml(yaml) and the shared
assertions plan.rule_id() == expected_rule_id and plan.language() ==
expected_language (e.g., Language::Rust and Language::Python) inside the single
test body to remove duplication; annotate the new test with #[rstest] and
provide case table entries containing the two existing YAML strings and their
expected values, keeping the same calls to default_engine and
engine.compile_yaml.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/sempai/src/normalize.rs`:
- Around line 252-264: The current normalize_dependency_principal uses a
TreeSitterQueryAtom with query "(ERROR) `@_dependency_check`" which can match real
ERROR nodes; change the query to use a node type that can never exist (e.g.
"__NONEXISTENT_NODE__") so the pattern is truly non-matchable: in
normalize_dependency_principal replace the query string in TreeSitterQueryAtom
(the call inside bare(...) that constructs
Formula::Atom(Atom::TreeSitterQuery(TreeSitterQueryAtom { query: ... }))) with a
deliberately impossible node name like "(__NONEXISTENT_NODE__)
`@_dependency_check`" (and update the inline comment to explain why).

In `@crates/sempai/src/semantic_check.rs`:
- Around line 48-76: The current check_no_not_in_or() only rejects a
Formula::Not at the branch root, allowing nested negations like Or([And([Atom,
Not])]) to slip through; add a helper fn branch_contains_not(formula: &Formula)
-> bool that recursively returns true for Formula::Not and traverses
Formula::And, Formula::Or, Formula::Inside, and Formula::Anywhere branches
(false for Atom), then in check_no_not_in_or replace the matches!(branch.node,
Formula::Not(_)) test with if branch_contains_not(&branch.node) to reject any
branch that contains a negation anywhere and keep the existing error
construction using branch.span or the outer span.
- Around line 83-125: The current is_positive_term wrongly treats any And/Or as
positive by shape; change is_positive_term (and remove const) to recursively
inspect And/Or branches so it returns true only if an Atom exists in its
descendants (i.e., match on Formula::Atom => true; Formula::And(branches) |
Formula::Or(branches) => branches.iter().any(|b| is_positive_term(&b.node));
Formula::Not | Formula::Inside | Formula::Anywhere => false), leaving
check_positive_term_in_and as-is to call the new function.

In `@docs/execplans/4-1-5-normalization-into-canonical-formula-model.md`:
- Around line 56-66: The doc text is out of sync: update the ExecPlan
description to reflect that semantic validation now lives in
crates/sempai/src/semantic_check.rs (not in sempai_core), and change the
Decorated<T> field names to match the code: where_clauses, as_name, fix, and
span: Option<SourceSpan>; also mention that sempai_core still owns Formula,
Atom, Decorated, WhereClause types, sempai_yaml owns LegacyFormula/MatchFormula,
and sempai (facade) wires parsing→normalization→validation→plan construction;
apply the same corrections to the repeated block referenced around lines
212-218.

---

Outside diff comments:
In `@crates/sempai/src/tests/engine_tests.rs`:
- Around line 68-98: Combine the two tests
compile_yaml_normalizes_and_returns_query_plans and
compile_yaml_normalizes_project_depends_on_search_rule into one parameterized
rstest that iterates cases of (yaml, expected_rule_id, expected_language); reuse
the same default_engine(), call engine.compile_yaml(yaml) and the shared
assertions plan.rule_id() == expected_rule_id and plan.language() ==
expected_language (e.g., Language::Rust and Language::Python) inside the single
test body to remove duplication; annotate the new test with #[rstest] and
provide case table entries containing the two existing YAML strings and their
expected values, keeping the same calls to default_engine and
engine.compile_yaml.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0888edfc-62dc-453b-bade-07480e3c886a

📥 Commits

Reviewing files that changed from the base of the PR and between 90ba5e4 and b41c038.

📒 Files selected for processing (16)
  • crates/sempai-core/Cargo.toml
  • crates/sempai-core/src/formula.rs
  • crates/sempai-core/src/lib.rs
  • crates/sempai/src/engine.rs
  • crates/sempai/src/lib.rs
  • crates/sempai/src/normalize.rs
  • crates/sempai/src/semantic_check.rs
  • crates/sempai/src/tests/behaviour.rs
  • crates/sempai/src/tests/engine_tests.rs
  • crates/sempai/src/tests/mod.rs
  • crates/sempai/src/tests/normalization_tests.rs
  • crates/sempai/src/tests/semantic_validation_tests.rs
  • docs/execplans/4-1-5-normalization-into-canonical-formula-model.md
  • docs/roadmap.md
  • docs/sempai-query-language-design.md
  • docs/users-guide.md

Comment thread crates/sempai/src/normalize.rs
Comment on lines +48 to +76
match formula {
Formula::Or(branches) => {
for branch in branches {
if matches!(branch.node, Formula::Not(_)) {
return Err(DiagnosticReport::validation_error(
DiagnosticCode::ESempaiInvalidNotInOr,
String::from(
"negated terms are not allowed inside disjunction (Or/pattern-either)",
),
branch.span.clone().or_else(|| span.cloned()),
vec![],
));
}
// Recursively check nested formulas
check_no_not_in_or(&branch.node, branch.span.as_ref().or(span))?;
}
Ok(())
}
Formula::And(branches) => {
for branch in branches {
check_no_not_in_or(&branch.node, branch.span.as_ref().or(span))?;
}
Ok(())
}
Formula::Not(inner) | Formula::Inside(inner) | Formula::Anywhere(inner) => {
check_no_not_in_or(&inner.node, inner.span.as_ref().or(span))
}
Formula::Atom(_) => Ok(()),
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject negation anywhere inside an Or branch.

Fail the whole branch, not just Formula::Not at the top level. Or([And([Atom(..), Not(..)]), Atom(..)]) currently passes because the recursion drops the disjunction context before it reaches the nested Not.

Patch sketch
+fn branch_contains_not(formula: &Formula) -> bool {
+    match formula {
+        Formula::Not(_) => true,
+        Formula::And(branches) | Formula::Or(branches) => {
+            branches.iter().any(|branch| branch_contains_not(&branch.node))
+        }
+        Formula::Inside(inner) | Formula::Anywhere(inner) => {
+            branch_contains_not(&inner.node)
+        }
+        Formula::Atom(_) => false,
+    }
+}
+
 fn check_no_not_in_or(
     formula: &Formula,
     span: Option<&sempai_core::SourceSpan>,
 ) -> Result<(), DiagnosticReport> {
     match formula {
         Formula::Or(branches) => {
             for branch in branches {
-                if matches!(branch.node, Formula::Not(_)) {
+                if branch_contains_not(&branch.node) {
                     return Err(DiagnosticReport::validation_error(
                         DiagnosticCode::ESempaiInvalidNotInOr,
                         String::from(
                             "negated terms are not allowed inside disjunction (Or/pattern-either)",
                         ),
                         branch.span.clone().or_else(|| span.cloned()),
                         vec![],
                     ));
                 }
                 check_no_not_in_or(&branch.node, branch.span.as_ref().or(span))?;
             }
             Ok(())
         }

Based on learnings: "Semantic validation of normalized formulas must emit E_SEMPAI_INVALID_NOT_IN_OR when negated terms appear in disjunction branches."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/sempai/src/semantic_check.rs` around lines 48 - 76, The current
check_no_not_in_or() only rejects a Formula::Not at the branch root, allowing
nested negations like Or([And([Atom, Not])]) to slip through; add a helper fn
branch_contains_not(formula: &Formula) -> bool that recursively returns true for
Formula::Not and traverses Formula::And, Formula::Or, Formula::Inside, and
Formula::Anywhere branches (false for Atom), then in check_no_not_in_or replace
the matches!(branch.node, Formula::Not(_)) test with if
branch_contains_not(&branch.node) to reject any branch that contains a negation
anywhere and keep the existing error construction using branch.span or the outer
span.

Comment thread crates/sempai/src/semantic_check.rs
Comment thread docs/execplans/4-1-5-normalization-into-canonical-formula-model.md Outdated
Address code review comments on the normalization implementation:

Improvements to semantic_check.rs:
- MissingPositiveTermInAnd diagnostic now prefers the first branch's span
  for a more precise error location, falling back to the enclosing And's
  span only when no branch has one.

New tests in engine_tests.rs:
- compile_yaml_returns_semantic_error_for_invalid_not_in_or: verifies
  that pattern-either containing a pattern-not surfaces as
  ESempaiInvalidNotInOr through the public API.
- compile_yaml_returns_semantic_error_for_missing_positive_term_in_and:
  verifies that patterns with only constraint-style terms surface as
  ESempaiMissingPositiveTermInAnd through the public API.

New tests in normalization_tests.rs:
- legacy_pattern_either_normalizes_to_or: exercises pattern-either
  legacy-to-canonical mapping.
- legacy_anywhere_normalizes_to_anywhere: exercises
  semgrep-internal-pattern-anywhere legacy-to-canonical mapping.
- legacy_patterns_propagates_constraints_to_where_clauses: verifies
  that LegacyFormula::Patterns constraints are mapped onto the
  enclosing And's where_clauses, with child formulas having none.
- v2_decorated_preserves_where_as_and_fix_metadata: verifies that
  MatchFormula::Decorated metadata (where, as, fix) is preserved on
  the canonical Decorated wrapper.
- span_propagates_from_search_principal_to_decorated: verifies that
  the rule_span is propagated through normalization.

New test in semantic_validation_tests.rs:
- single_positive_atom_passes_validation: sanity check that the
  simplest valid formula (a single Atom) passes validate_formula.

Documentation fixes:
- Normalize "normalisation" to "normalization" in the ExecPlan to
  match the document's title and surrounding text.
- Expand ASTs acronym on first use.

Test count: 33 -> 41 passing tests in sempai crate.
All clippy, rustfmt, markdownlint, and cargo test checks pass.

Note: the code review also flagged a potential Eq-derive issue on
WhereClause; this is a false positive as serde_json::Value implements
Eq. The existing derive compiles cleanly and all workspace lints pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
codescene-delta-analysis[bot]

This comment was marked as outdated.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/execplans/4-1-5-normalization-into-canonical-formula-model.md`:
- Around line 584-595: The docs show an outdated Decorated<T> interface; update
the "Interfaces and dependencies" section and any other occurrences to match the
real struct fields: ensure the struct is documented with fields node,
where_clauses, as_name, fix, and span (matching the implemented Decorated<T>),
and remove or rename any doc references to old field names so the documentation
reflects the shipped API exactly (search for Decorated<T>, where_clauses,
as_name, fix, span to locate all places to change).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 784bdbf5-23c5-466f-8d60-f529b92e5629

📥 Commits

Reviewing files that changed from the base of the PR and between b41c038 and 46b9eb7.

📒 Files selected for processing (5)
  • crates/sempai/src/semantic_check.rs
  • crates/sempai/src/tests/engine_tests.rs
  • crates/sempai/src/tests/normalization_tests.rs
  • crates/sempai/src/tests/semantic_validation_tests.rs
  • docs/execplans/4-1-5-normalization-into-canonical-formula-model.md

Comment thread docs/execplans/4-1-5-normalization-into-canonical-formula-model.md
Normalization has no real error paths today, so wrapping the result in
Result<_, DiagnosticReport> required an `#[expect(clippy::unnecessary_wraps)]`
escape hatch and forced every caller to use `?`. Addressing review feedback,
return a bare `Decorated<Formula>` instead and document that the signature
should grow a Result only when a mapping first needs to signal a structural
error.

Callers in engine.rs and the normalization tests are updated to drop the
`?` / `.expect(...)` noise.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
codescene-delta-analysis[bot]

This comment was marked as outdated.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
crates/sempai/src/normalize.rs (1)

252-258: ⚠️ Potential issue | 🟠 Major

Replace the matchable (ERROR) dependency placeholder.

Fix Line 257. Stop emitting a query that can match parser recovery nodes in malformed source. Encode a guaranteed non-match placeholder path instead of a live query shape.

Safer placeholder strategy (non-match by contradiction)
-        Formula::Atom(Atom::TreeSitterQuery(TreeSitterQueryAtom {
-            query: String::from("(ERROR) `@_dependency_check`"),
-        })),
+        Formula::Atom(Atom::TreeSitterQuery(TreeSitterQueryAtom {
+            // Keep this syntactically valid while making it semantically impossible to match.
+            query: String::from(
+                "(ERROR) `@_dependency_check` (`#eq`? `@_dependency_check` \"__sempai_never__\") (`#eq`? `@_dependency_check` \"__sempai_also_never__\")",
+            ),
+        })),
In Tree-sitter query semantics, does `(ERROR) `@cap`` match parse-recovery ERROR nodes in malformed syntax trees?

Are multiple predicates on the same capture conjunctive, such that contradictory `#eq?` predicates force a no-match result?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/sempai/src/normalize.rs` around lines 252 - 258, Replace the live
Tree-sitter query string "(ERROR) `@_dependency_check`" with a guaranteed
non-matching placeholder so the rule cannot accidentally match parser-recovery
nodes; update the TreeSitterQueryAtom creation in the
bare(Formula::Atom(Atom::TreeSitterQuery(TreeSitterQueryAtom { query: ... })))
expression to use a non-existent node name or an explicitly contradictory
predicate (for example a node type like "__never_node__" or a capture with a
contradiction) instead of "(ERROR) `@_dependency_check`" so the query is
impossible to satisfy at parse time.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/sempai/src/engine.rs`:
- Around line 17-39: The code currently clones Decorated<Formula> for each
language in compile_rule_plans; change QueryPlan to hold Arc<Decorated<Formula>>
and update compile_rule_plans to wrap the incoming formula in Arc::new(...) once
and clone the Arc for each QueryPlan instead of deep-cloning the formula. Update
other call sites mentioned (the blocks at the other two places referenced) to
construct QueryPlan with Arc<Decorated<Formula>> as well, and adjust
signature/usages of QueryPlan::new and any fields to accept
Arc<Decorated<Formula>> so the formula is shared across per-language plans.

In `@crates/sempai/src/normalize.rs`:
- Line 25: Update the malformed module doc bullet that currently reads //! -
`regex: "...` → `Formula::Atom(Atom::Regex(...))` by closing the inline code
literal (change it to //! - `regex: "..."` → `Formula::Atom(Atom::Regex(...))`)
so the backtick pair is balanced; locate the module-level doc comment containing
that snippet in normalize.rs and fix the punctuation to keep generated docs
syntactically valid.

In `@crates/sempai/src/tests/normalization_tests.rs`:
- Around line 260-267: Add a new unit test that constructs a
SearchQueryPrincipal::ProjectDependsOn variant and calls
normalize_search_principal with a non‑None SourceSpan to ensure the
ProjectDependsOn branch is exercised; assert the returned normalized principal
has the canonical shape you expect for ProjectDependsOn (match the same
structural form used elsewhere in tests) and that normalized.span == Some(span)
to lock metadata propagation for this branch. Locate the normalization
entrypoint normalize_search_principal and mirror the style of the existing
span_propagates_from_search_principal_to_decorated test: create a SourceSpan,
build a ProjectDependsOn principal value, call
normalize_search_principal(&principal, Some(&span)), then add explicit
assertions for both the normalized principal variant/contents and the span
propagation.
- Around line 37-154: Multiple small tests repeat the same
normalize_legacy/normalize_v2 assertions; replace these duplicated single-case
tests (e.g., legacy_pattern_normalizes_to_atom,
legacy_pattern_regex_normalizes_to_regex_atom,
v2_pattern_shorthand_normalizes_to_atom, v2_regex_normalizes_to_regex_atom,
v2_all_normalizes_to_and, v2_any_normalizes_to_or, v2_not_normalizes_to_not,
v2_inside_normalizes_to_inside, v2_anywhere_normalizes_to_anywhere) with
parameterized rstest table cases that iterate inputs
(LegacyFormula::Pattern/PatternRegex/... and
MatchFormula::Pattern/Regex/All/Any/Not/Inside/Anywhere) and expected normalized
predicate types from normalize_legacy and normalize_v2; keep the
nested-structure tests (like legacy_pattern_not_regex_normalizes_to_not_regex
and legacy_pattern_not_inside_normalizes_to_not_inside) as bespoke tests. Use
#[rstest] with case(...) entries and assert matches!(result, ...) or equality as
appropriate, referencing normalize_legacy and normalize_v2 in the fixture.

---

Duplicate comments:
In `@crates/sempai/src/normalize.rs`:
- Around line 252-258: Replace the live Tree-sitter query string "(ERROR)
`@_dependency_check`" with a guaranteed non-matching placeholder so the rule
cannot accidentally match parser-recovery nodes; update the TreeSitterQueryAtom
creation in the bare(Formula::Atom(Atom::TreeSitterQuery(TreeSitterQueryAtom {
query: ... }))) expression to use a non-existent node name or an explicitly
contradictory predicate (for example a node type like "__never_node__" or a
capture with a contradiction) instead of "(ERROR) `@_dependency_check`" so the
query is impossible to satisfy at parse time.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c1d318bc-36dd-49fe-aafc-3c221f1544a6

📥 Commits

Reviewing files that changed from the base of the PR and between 46b9eb7 and bd3f5ee.

📒 Files selected for processing (3)
  • crates/sempai/src/engine.rs
  • crates/sempai/src/normalize.rs
  • crates/sempai/src/tests/normalization_tests.rs

Comment on lines +17 to +39
fn compile_rule_plans(
rule: &Rule,
formula: &Decorated<Formula>,
) -> Result<Vec<QueryPlan>, DiagnosticReport> {
rule.languages()
.iter()
.map(|lang_str| {
let language = lang_str.parse::<Language>().map_err(|e| {
DiagnosticReport::validation_error(
DiagnosticCode::ESempaiSchemaInvalid,
format!("unsupported language '{lang_str}': {e}"),
rule.rule_span().cloned(),
vec![],
)
})?;
Ok(QueryPlan::new(
rule.id().to_owned(),
language,
formula.clone(),
))
})
.collect()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Share canonical formula instances across per-language plans.

Stop deep-cloning Decorated<Formula> for every language. Store the formula as Arc<Decorated<Formula>> in QueryPlan and clone the Arc in compile_rule_plans.

Also applies to: 58-63, 72-77

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/sempai/src/engine.rs` around lines 17 - 39, The code currently clones
Decorated<Formula> for each language in compile_rule_plans; change QueryPlan to
hold Arc<Decorated<Formula>> and update compile_rule_plans to wrap the incoming
formula in Arc::new(...) once and clone the Arc for each QueryPlan instead of
deep-cloning the formula. Update other call sites mentioned (the blocks at the
other two places referenced) to construct QueryPlan with Arc<Decorated<Formula>>
as well, and adjust signature/usages of QueryPlan::new and any fields to accept
Arc<Decorated<Formula>> so the formula is shared across per-language plans.

//!
//! - `"..."` (string shorthand) → `Formula::Atom(Atom::Pattern(...))`
//! - `pattern: "..."` → `Formula::Atom(Atom::Pattern(...))`
//! - `regex: "...` → `Formula::Atom(Atom::Regex(...))`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the malformed v2 regex doc bullet.

Correct Line 25 to close the inline literal (\regex: "..."``). Keep module docs syntactically clean for generated documentation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/sempai/src/normalize.rs` at line 25, Update the malformed module doc
bullet that currently reads //! - `regex: "...` →
`Formula::Atom(Atom::Regex(...))` by closing the inline code literal (change it
to //! - `regex: "..."` → `Formula::Atom(Atom::Regex(...))`) so the backtick
pair is balanced; locate the module-level doc comment containing that snippet in
normalize.rs and fix the punctuation to keep generated docs syntactically valid.

Comment on lines +37 to +154
#[test]
fn legacy_pattern_normalizes_to_atom() {
let result = normalize_legacy(LegacyFormula::Pattern(String::from("foo($X)")));
assert_eq!(
result,
Formula::Atom(Atom::Pattern(PatternAtom {
text: String::from("foo($X)")
}))
);
}

#[test]
fn legacy_pattern_regex_normalizes_to_regex_atom() {
let result = normalize_legacy(LegacyFormula::PatternRegex(String::from(r"foo_\d+")));
assert_eq!(
result,
Formula::Atom(Atom::Regex(RegexAtom {
pattern: String::from(r"foo_\d+")
}))
);
}

#[test]
fn legacy_pattern_not_regex_normalizes_to_not_regex() {
let result = normalize_legacy(LegacyFormula::PatternNotRegex(String::from(r"bar.*")));
match result {
Formula::Not(inner) => match inner.node {
Formula::Atom(Atom::Regex(regex)) => {
assert_eq!(regex.pattern, r"bar.*");
}
_ => panic!("expected Regex atom inside Not"),
},
_ => panic!("expected Not formula"),
}
}

#[test]
fn legacy_pattern_not_inside_normalizes_to_not_inside() {
let result = normalize_legacy(LegacyFormula::PatternNotInside(Box::new(
LegacyValue::String(String::from("class Foo:")),
)));
match result {
Formula::Not(not_inner) => match not_inner.node {
Formula::Inside(inside_inner) => match inside_inner.node {
Formula::Atom(Atom::Pattern(pat)) => {
assert_eq!(pat.text, "class Foo:");
}
_ => panic!("expected Pattern atom"),
},
_ => panic!("expected Inside formula inside Not"),
},
_ => panic!("expected Not formula"),
}
}

#[test]
fn v2_pattern_shorthand_normalizes_to_atom() {
let result = normalize_v2(MatchFormula::Pattern(String::from("bar($Y)")));
assert_eq!(
result,
Formula::Atom(Atom::Pattern(PatternAtom {
text: String::from("bar($Y)")
}))
);
}

#[test]
fn v2_regex_normalizes_to_regex_atom() {
let result = normalize_v2(MatchFormula::Regex(String::from(r"baz_\w+")));
assert_eq!(
result,
Formula::Atom(Atom::Regex(RegexAtom {
pattern: String::from(r"baz_\w+")
}))
);
}

#[test]
fn v2_all_normalizes_to_and() {
let result = normalize_v2(MatchFormula::All(vec![
MatchFormula::Pattern(String::from("foo")),
MatchFormula::Pattern(String::from("bar")),
]));
assert!(matches!(result, Formula::And(ref branches) if branches.len() == 2));
}

#[test]
fn v2_any_normalizes_to_or() {
let result = normalize_v2(MatchFormula::Any(vec![
MatchFormula::Pattern(String::from("foo")),
MatchFormula::Pattern(String::from("bar")),
]));
assert!(matches!(result, Formula::Or(ref branches) if branches.len() == 2));
}

#[test]
fn v2_not_normalizes_to_not() {
let result = normalize_v2(MatchFormula::Not(Box::new(MatchFormula::Pattern(
String::from("baz"),
))));
assert!(matches!(result, Formula::Not(_)));
}

#[test]
fn v2_inside_normalizes_to_inside() {
let result = normalize_v2(MatchFormula::Inside(Box::new(MatchFormula::Pattern(
String::from("class X:"),
))));
assert!(matches!(result, Formula::Inside(_)));
}

#[test]
fn v2_anywhere_normalizes_to_anywhere() {
let result = normalize_v2(MatchFormula::Anywhere(Box::new(MatchFormula::Pattern(
String::from("unsafe"),
))));
assert!(matches!(result, Formula::Anywhere(_)));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Parameterize duplicated normalisation tests with rstest.

Replace the repeated single-case tests with #[rstest] table cases and keep bespoke tests only for nested-structure assertions.

Refactor pattern to remove duplication
+use rstest::rstest;
+
 #[test]
-fn v2_pattern_shorthand_normalizes_to_atom() {
-    let result = normalize_v2(MatchFormula::Pattern(String::from("bar($Y)")));
-    assert_eq!(
-        result,
-        Formula::Atom(Atom::Pattern(PatternAtom {
-            text: String::from("bar($Y)")
-        }))
-    );
-}
-
-#[test]
-fn v2_regex_normalizes_to_regex_atom() {
-    let result = normalize_v2(MatchFormula::Regex(String::from(r"baz_\w+")));
-    assert_eq!(
-        result,
-        Formula::Atom(Atom::Regex(RegexAtom {
-            pattern: String::from(r"baz_\w+")
-        }))
-    );
+#[rstest]
+#[case(
+    MatchFormula::Pattern(String::from("bar($Y)")),
+    Formula::Atom(Atom::Pattern(PatternAtom { text: String::from("bar($Y)") }))
+)]
+#[case(
+    MatchFormula::Regex(String::from(r"baz_\w+")),
+    Formula::Atom(Atom::Regex(RegexAtom { pattern: String::from(r"baz_\w+") }))
+)]
+fn v2_leaf_forms_normalize(#[case] input: MatchFormula, #[case] expected: Formula) {
+    let result = normalize_v2(input);
+    assert_eq!(result, expected);
 }

As per coding guidelines: "Use rstest fixtures for shared setup in Rust tests." and "Replace duplicated tests with #[rstest(...)] parameterised cases in Rust."

Also applies to: 156-193

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/sempai/src/tests/normalization_tests.rs` around lines 37 - 154,
Multiple small tests repeat the same normalize_legacy/normalize_v2 assertions;
replace these duplicated single-case tests (e.g.,
legacy_pattern_normalizes_to_atom,
legacy_pattern_regex_normalizes_to_regex_atom,
v2_pattern_shorthand_normalizes_to_atom, v2_regex_normalizes_to_regex_atom,
v2_all_normalizes_to_and, v2_any_normalizes_to_or, v2_not_normalizes_to_not,
v2_inside_normalizes_to_inside, v2_anywhere_normalizes_to_anywhere) with
parameterized rstest table cases that iterate inputs
(LegacyFormula::Pattern/PatternRegex/... and
MatchFormula::Pattern/Regex/All/Any/Not/Inside/Anywhere) and expected normalized
predicate types from normalize_legacy and normalize_v2; keep the
nested-structure tests (like legacy_pattern_not_regex_normalizes_to_not_regex
and legacy_pattern_not_inside_normalizes_to_not_inside) as bespoke tests. Use
#[rstest] with case(...) entries and assert matches!(result, ...) or equality as
appropriate, referencing normalize_legacy and normalize_v2 in the fixture.

Comment on lines +260 to +267
#[test]
fn span_propagates_from_search_principal_to_decorated() {
let span = SourceSpan::new(5, 42, Some(String::from("file:///rule.yaml")));
let principal = SearchQueryPrincipal::Match(MatchFormula::Pattern(String::from("foo($X)")));
let normalized = normalize_search_principal(&principal, Some(&span));

assert_eq!(normalized.span, Some(span));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add explicit coverage for the ProjectDependsOn normalisation branch.

Add a regression test that constructs SearchQueryPrincipal::ProjectDependsOn(...) and asserts the canonical output shape and metadata propagation. Lock this branch with a dedicated assertion to prevent silent regressions in normalize_search_principal dispatch coverage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/sempai/src/tests/normalization_tests.rs` around lines 260 - 267, Add a
new unit test that constructs a SearchQueryPrincipal::ProjectDependsOn variant
and calls normalize_search_principal with a non‑None SourceSpan to ensure the
ProjectDependsOn branch is exercised; assert the returned normalized principal
has the canonical shape you expect for ProjectDependsOn (match the same
structural form used elsewhere in tests) and that normalized.span == Some(span)
to lock metadata propagation for this branch. Locate the normalization
entrypoint normalize_search_principal and mirror the style of the existing
span_propagates_from_search_principal_to_decorated test: create a SourceSpan,
build a ProjectDependsOn principal value, call
normalize_search_principal(&principal, Some(&span)), then add explicit
assertions for both the normalized principal variant/contents and the span
propagation.

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Apr 24, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response.

Code Duplication

crates/sempai/src/tests/engine_tests.rs:

What lead to degradation?

The module contains 2 functions with similar structure: compile_yaml_returns_semantic_error_for_invalid_not_in_or,compile_yaml_returns_semantic_error_for_missing_positive_term_in_and

Why does this problem occur?

Duplicated code often leads to code that's harder to change since the same logical change has to be done in multiple functions. More duplication gives lower code health.

How to fix it?

A certain degree of duplicated code might be acceptable. The problems start when it is the same behavior that is duplicated across the functions in the module, ie. a violation of the Don't Repeat Yourself (DRY) principle. DRY violations lead to code that is changed together in predictable patterns, which is both expensive and risky. DRY violations can be identified using CodeScene's X-Ray analysis to detect clusters of change coupled functions with high code similarity. Read More
Once you have identified the similarities across functions, look to extract and encapsulate the concept that varies into its own function(s). These shared abstractions can then be re-used, which minimizes the amount of duplication and simplifies change.

@coderabbitai

This comment was marked as resolved.

…l success

- Semantic validation: make `is_positive_term` recursive so that shapes such
  as `And[Or[Inside[Atom]]]` are correctly flagged as
  `MissingPositiveTermInAnd` instead of being accepted on the basis of the
  outer combinator shape alone. Add a targeted test for the
  `And[Or[Inside]]` case.
- Normalization: replace the dependency-principal placeholder query
  `(ERROR) @_dependency_check` with `(__NONEXISTENT_NODE__) @_dependency_check`
  so it cannot match real Tree-sitter ERROR nodes.
- BDD: update `sempai_engine.feature` to assert that valid search rules
  (including `r2c-internal-project-depends-on`) now compile successfully
  with a single query plan, add the matching `Then` steps, and switch the
  test world's `compile_result` to `Result<Vec<QueryPlan>, DiagnosticReport>`.
- Engine tests: collapse the two `compile_yaml_*success*` cases into a
  single `#[rstest]` parameterized test covering both the legacy pattern
  and the project-depends-on search rule.
- Docs: correct the ExecPlan to describe `Decorated<T>` with the shipped
  field names (`where_clauses`, `as_name`, `fix`, `span: Option<SourceSpan>`),
  and make explicit that normalization lives in
  `crates/sempai/src/normalize.rs` and semantic validation in
  `crates/sempai/src/semantic_check.rs`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
codescene-delta-analysis[bot]

This comment was marked as outdated.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
crates/sempai/src/normalize.rs (1)

25-25: ⚠️ Potential issue | 🟡 Minor

Fix the malformed v2 regex doc bullet.

Close the inline literal: change `regex: "...` to `regex: "..."`.

Proposed fix
-//! - `regex: "...` → `Formula::Atom(Atom::Regex(...))`
+//! - `regex: "..."` → `Formula::Atom(Atom::Regex(...))`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/sempai/src/normalize.rs` at line 25, The doc comment in
crates/sempai/src/normalize.rs contains a malformed inline literal in the v2
regex bullet (`regex: "...`); update that inline backtick sequence to close
properly so it reads `regex: "..."` (i.e., change the broken `` `regex: "..."``
to `` `regex: "..."` ``) to fix the documentation formatting.
crates/sempai/src/semantic_check.rs (1)

48-65: ⚠️ Potential issue | 🟠 Major

Nested negations inside Or branches slip through validation.

The current check at line 51 only detects Formula::Not at the immediate branch level. A structure like Or([And([Atom(..), Not(..)])]) passes because the recursion at line 62 drops the disjunction context before reaching the nested Not.

Add a helper that recursively searches for any Not within a branch before rejecting it.

Proposed fix
+fn branch_contains_not(formula: &Formula) -> bool {
+    match formula {
+        Formula::Not(_) => true,
+        Formula::And(branches) | Formula::Or(branches) => {
+            branches.iter().any(|branch| branch_contains_not(&branch.node))
+        }
+        Formula::Inside(inner) | Formula::Anywhere(inner) => branch_contains_not(&inner.node),
+        Formula::Atom(_) => false,
+    }
+}
+
 fn check_no_not_in_or(
     formula: &Formula,
     span: Option<&sempai_core::SourceSpan>,
 ) -> Result<(), DiagnosticReport> {
     match formula {
         Formula::Or(branches) => {
             for branch in branches {
-                if matches!(branch.node, Formula::Not(_)) {
+                if branch_contains_not(&branch.node) {
                     return Err(DiagnosticReport::validation_error(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/sempai/src/semantic_check.rs` around lines 48 - 65, The current check
in check_no_not_in_or only rejects immediate Formula::Not children of a
Formula::Or branch but loses the "inside Or" context when it recurses, so nested
Not nodes slip through; modify check_no_not_in_or (or add a small helper like
contains_not_recursive) that, when called on each branch.node of a Formula::Or,
walks the formula tree and returns true if any Formula::Not is found anywhere
inside that branch (without clearing the "in-or" context), and then produce the
existing DiagnosticReport using branch.span.clone().or_else(|| span.cloned()) if
the helper finds a Not; keep using the same
DiagnosticCode::ESempaiInvalidNotInOr and error message text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/sempai/src/tests/engine_tests.rs`:
- Around line 103-135: Extract a shared test helper or parameterize the two
tests using rstest: move the common setup (default_engine(),
engine.compile_yaml(...), first_diagnostic_of_err(result)) into a helper
function like assert_compile_yaml_semantic_error(yaml: &str) -> DiagnosticCode
and call it from the existing tests, or replace both test functions
(compile_yaml_returns_semantic_error_for_invalid_not_in_or and
compile_yaml_returns_semantic_error_for_missing_positive_term_in_and) with a
single #[rstest] that provides the YAML string and expected DiagnosticCode pairs
and asserts equality using default_engine(), engine.compile_yaml(...),
first_diagnostic_of_err(result), and DiagnosticCode::ESempaiInvalidNotInOr /
DiagnosticCode::ESempaiMissingPositiveTermInAnd; ensure the helper/parametrized
test references the same symbols (default_engine, engine.compile_yaml,
first_diagnostic_of_err, DiagnosticCode) so behavior is unchanged.

In `@crates/sempai/src/tests/semantic_validation_tests.rs`:
- Around line 151-180: Add a new unit test asserting validation fails when a Not
is nested inside an And that is itself a branch of an Or (e.g., Or[ And[ Atom,
Not[...] ], ... ]). Create a test (e.g., nested_not_in_and_inside_or_fails) that
constructs a Decorated node using Formula::Or containing a Decorated
Formula::And which includes make_pattern("atom") and a Decorated Formula::Not,
call validate_formula(&formula), expect_err, and assert the first diagnostic
code equals DiagnosticCode::ESempaiInvalidNotInOr to ensure the recursive
branch_contains_not detection catches deeply nested negations.

---

Duplicate comments:
In `@crates/sempai/src/normalize.rs`:
- Line 25: The doc comment in crates/sempai/src/normalize.rs contains a
malformed inline literal in the v2 regex bullet (`regex: "...`); update that
inline backtick sequence to close properly so it reads `regex: "..."` (i.e.,
change the broken `` `regex: "..."`` to `` `regex: "..."` ``) to fix the
documentation formatting.

In `@crates/sempai/src/semantic_check.rs`:
- Around line 48-65: The current check in check_no_not_in_or only rejects
immediate Formula::Not children of a Formula::Or branch but loses the "inside
Or" context when it recurses, so nested Not nodes slip through; modify
check_no_not_in_or (or add a small helper like contains_not_recursive) that,
when called on each branch.node of a Formula::Or, walks the formula tree and
returns true if any Formula::Not is found anywhere inside that branch (without
clearing the "in-or" context), and then produce the existing DiagnosticReport
using branch.span.clone().or_else(|| span.cloned()) if the helper finds a Not;
keep using the same DiagnosticCode::ESempaiInvalidNotInOr and error message
text.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7ce5cff4-a1da-49ae-ab84-e1a14bd51d32

📥 Commits

Reviewing files that changed from the base of the PR and between bd3f5ee and ed1c94e.

📒 Files selected for processing (7)
  • crates/sempai/src/normalize.rs
  • crates/sempai/src/semantic_check.rs
  • crates/sempai/src/tests/behaviour.rs
  • crates/sempai/src/tests/engine_tests.rs
  • crates/sempai/src/tests/semantic_validation_tests.rs
  • crates/sempai/tests/features/sempai_engine.feature
  • docs/execplans/4-1-5-normalization-into-canonical-formula-model.md

Comment thread crates/sempai/src/tests/engine_tests.rs
Comment on lines +151 to +180
#[test]
fn nested_or_in_and_with_not_fails() {
let nested_or = Decorated {
node: Formula::Or(vec![
make_pattern("a"),
Decorated {
node: Formula::Not(Box::new(make_pattern("b"))),
where_clauses: vec![],
as_name: None,
fix: None,
span: None,
},
]),
where_clauses: vec![],
as_name: None,
fix: None,
span: None,
};
let formula = Decorated {
node: Formula::And(vec![make_pattern("foo"), nested_or]),
where_clauses: vec![],
as_name: None,
fix: None,
span: None,
};
let result = validate_formula(&formula);
let err = result.expect_err("should fail validation");
let first = err.diagnostics().first().expect("should have diagnostic");
assert_eq!(first.code(), DiagnosticCode::ESempaiInvalidNotInOr);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add a test case for deeply nested Not inside Or branch.

This test validates And[foo, Or[a, Not[b]]], but the nested Not is a direct child of the Or. Once the branch_contains_not fix is applied to semantic_check.rs, add a test for Or[And[Atom, Not[...]]] to verify the recursive detection catches negations buried deeper within a branch.

#!/bin/bash
# Description: Check if any test exercises Or with And containing Not as a nested child.
# Expected: No existing test for Or([And([Atom, Not])]) pattern.
rg -n 'Formula::Or.*Formula::And.*Formula::Not' crates/sempai/src/tests/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/sempai/src/tests/semantic_validation_tests.rs` around lines 151 - 180,
Add a new unit test asserting validation fails when a Not is nested inside an
And that is itself a branch of an Or (e.g., Or[ And[ Atom, Not[...] ], ... ]).
Create a test (e.g., nested_not_in_and_inside_or_fails) that constructs a
Decorated node using Formula::Or containing a Decorated Formula::And which
includes make_pattern("atom") and a Decorated Formula::Not, call
validate_formula(&formula), expect_err, and assert the first diagnostic code
equals DiagnosticCode::ESempaiInvalidNotInOr to ensure the recursive
branch_contains_not detection catches deeply nested negations.

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Apr 24, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response.

crates/sempai/src/tests/engine_tests.rs

Comment on lines +121 to +134

fn compile_yaml_returns_semantic_error_for_missing_positive_term_in_and() {
    let engine = default_engine();
    let result = engine.compile_yaml(concat!(
        "rules:\n",
        "  - id: demo.missing.positive.term.in.and\n",
        "    message: missing positive term in and\n",
        "    languages: [rust]\n",
        "    severity: ERROR\n",
        "    patterns:\n",
        "      - pattern-not: foo($X)\n",
        "      - pattern-inside: bar($Y)\n",
    ));
    let (code, _diag) = first_diagnostic_of_err(result);
    assert_eq!(code, DiagnosticCode::ESempaiMissingPositiveTermInAnd);

❌ New issue: Code Duplication
The module contains 2 functions with similar structure: compile_yaml_returns_semantic_error_for_invalid_not_in_or,compile_yaml_returns_semantic_error_for_missing_positive_term_in_and

@coderabbitai

This comment was marked as resolved.

Refactored repeated code in engine_tests.rs by introducing a helper
function `assert_compile_yaml_semantic_error`. This DRYs up tests that
check for expected semantic error diagnostic codes from YAML compilation.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
@leynos leynos changed the title Normalize legacy and v2 into canonical Formula with semantic checks Implement normalization into canonical Formula with semantic checks Apr 24, 2026
Copy link
Copy Markdown

@codescene-delta-analysis codescene-delta-analysis Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gates Failed
Enforce advisory code health rules (1 file with Code Duplication)

Gates Passed
5 Quality Gates Passed

See analysis details in CodeScene

Reason for failure
Enforce advisory code health rules Violations Code Health Impact
engine_tests.rs 1 advisory rule 10.00 → 9.39 Suppress

Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.

Comment on lines +121 to +134
fn compile_yaml_returns_semantic_error_for_missing_positive_term_in_and() {
assert_compile_yaml_semantic_error(
concat!(
"rules:\n",
" - id: demo.missing.positive.term.in.and\n",
" message: missing positive term in and\n",
" languages: [rust]\n",
" severity: ERROR\n",
" patterns:\n",
" - pattern-not: foo($X)\n",
" - pattern-inside: bar($Y)\n",
),
DiagnosticCode::ESempaiMissingPositiveTermInAnd,
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ New issue: Code Duplication
The module contains 2 functions with similar structure: compile_yaml_returns_semantic_error_for_invalid_not_in_or,compile_yaml_returns_semantic_error_for_missing_positive_term_in_and

Suppress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant