Implement canonical Formula normalization; wire into engine (4.1.5) by leynos · Pull Request #105 · leynos/weaver

leynos · 2026-04-12T11:35:42Z

Summary

Implement end-to-end normalization of both legacy Semgrep operators and v2 match operators into a canonical Formula model, expose a stable interface, and wire this through the engine. This completes milestone 4.1.5 and replaces the previous documentation-only patch with a full implementation path.
The ExecPlan document remains the planning artifact for this milestone and has been complemented by concrete code changes and tests.

Changes

Core formula model
- New: crates/sempai-core/src/formula.rs introducing Canonical normalised formula model with
  - Atom enum (Pattern, Regex, TreeSitterQuery)
  - Decorated wrapper carrying where/as/fix metadata
  - Formula enum: Atom, Not, Inside, Anywhere, And, Or, Constraint
- Public re-exports added in crates/sempai-core/src/lib.rs for Atom, Decorated, and Formula.
Normalisation pipeline (sempai crate)
- New modular normalization: crates/sempai/src/normalise/mod.rs plus
  - constraints.rs, legacy.rs, v2.rs
- Public function: normalise_search_principal(...) to map from YAML/primitives to Option (None for ProjectDependsOn)
- Public function: validate_formula_constraints(...) to enforce semantic invariants on the canonical tree
Engine integration
- crates/sempai/src/engine.rs updated to:
  - Carry an optional Formula in QueryPlan instead of a placeholder
  - Wire normalization into compile_yaml: for each search-rule, normalise the principal, validate constraints, and attach the resulting Formula to the plan
  - Expose a formula() accessor on QueryPlan
Tests
- Unit tests for formula types: crates/sempai-core/src/tests/formula_tests.rs
- Normalisation tests: crates/sempai/src/tests/normalise_tests.rs
- Constraint validation tests: crates/sempai/src/tests/constraint_tests.rs
- Engine tests updated to expect plans carrying Formula and to validate plan.formula() behaviour
- Re-export tests updated to reflect new Formula types
Behaviour and features
- BDD/tests feature file for formula normalization added: crates/sempai/tests/features/formula_normalization.feature
- Updated tests for engine behaviour to verify plan count and presence/absence of formulas
Documentation and planning artefacts
- ExecPlan added/updated: docs/execplans/4-1-5-normalization-into-canonical-formula-model.md
- Roadmap and design references adjusted to reflect normalization path and API changes
Task reference
- Task: https://www.devboxer.com/task/f1a7d325-cc6e-481e-bc46-c7d7fe3f0c56

Artefacts

New files:
- crates/sempai-core/src/formula.rs
- crates/sempai-core/src/tests/formula_tests.rs
- crates/sempai/src/normalise/mod.rs
- crates/sempai/src/normalise/constraints.rs
- crates/sempai/src/normalise/legacy.rs
- crates/sempai/src/normalise/v2.rs
- crates/sempai/src/normalise/README (if any) and related test scaffolding
- crates/sempai/src/tests/normalise_tests.rs
- crates/sempai/src/tests/constraint_tests.rs
- crates/sempai/src/tests/engine_tests.rs (updated expectations)
- crates/sempai/tests/features/formula_normalization.feature
- docs/execplans/4-1-5-normalization-into-canonical-formula-model.md
The ExecPlan document remains the gating artifact for this milestone and is now complemented by concrete code/tests.

Rationale & design decisions

Crate placement: canonical Formula and its types live in sempai-core to serve downstream consumers; normalization logic lives in the sempai crate to avoid circular dependencies with sempai_yaml. This aligns with the Option B approach described in the ExecPlan.
Semantics and constraints: preservation of metavariable constraints via Formula::Constraint while introducing deterministic semantic constraints via validate_formula_constraints. This ensures a clean separation between parsing/structural validation and semantic rule validation.
ProjectDependsOn passthrough: Rules of this kind produce plans with formula = None, allowing downstream handling without forcing a formula.
Decorated metadata: v2 Decorated metadata is preserved on inner formulaes; top-level normalization returns the inner Formula node, but metadata continues to flow through nested Decorated contexts.
Test-intensive approach: unit tests for the core Formula model, normalization maps, semantic constraints, and engine integration tests ensure broad coverage of happy paths, edge cases, and error states before stable API usage.

Testing plan

Unit tests:
- Formula type construction and equality (atoms, regex, TSQuery)
- Decorated wrapper behaviour
- Formula construction (Not/Inside/Anywhere/And/Or/Constraint) and is_positive_term classification
Normalisation tests:
- Legacy and v2 normalization paths (Pattern, PatternRegex, Patterns, PatternEither, etc.)
- Paired equivalence tests for legacy vs v2 mappings
- Decorated v2 metadata preservation within nested structures
- Deep nesting cases
Constraint tests:
- InvalidNotInOr when Not appears under Or
- MissingPositiveTermInAnd across various compositions, including constraints edge cases
Engine tests:
- compile_yaml produces plans with formulas for valid rules
- compile_yaml produces plans without formulas for ProjectDependsOn rules
- Execution path remains not implemented for now, with plans carrying the formula for future backend
Behaviour/BDD tests:
- Updated/added scenarios to reflect formula normalization outcomes

Milestones progress (high level)

Status: COMPLETE for 4.1.5 (Implement legacy and v2 normalization into one canonical Formula model with semantic constraint checks). All tests and gates updated accordingly.

Task

https://www.devboxer.com/task/f1a7d325-cc6e-481e-bc46-c7d7fe3f0c56

📎 Task: https://www.devboxer.com/task/21bc5583-37ef-4f65-88fa-8c6f1f8f662c

Add a canonical Formula enum and associated Atom and Decorated types to sempai_core. Implement normalization of legacy and v2 Semgrep operators into the canonical Formula model. Enforce semantic constraints rejecting invalid formula shapes with diagnostic codes. Update Engine::compile_yaml to produce QueryPlan structs carrying normalized Formulas instead of placeholders. Preserve legacy constraints as opaque Formula::Constraint nodes. Add extensive unit and BDD tests covering normalization and constraint validation. Update documentation and roadmap to reflect new normalization and validation behavior. This completes roadmap item 4.1.5 with fully tested normalization pipeline and user-visible error reporting. Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>

coderabbitai · 2026-04-12T11:35:54Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0004a6be-9753-4133-b494-1a21ed7131dd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch implement-normalization-canonical-formula-43j33j

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai · 2026-04-12T11:36:33Z

Reviewer's Guide

Implements a canonical Formula model in sempai_core and wires a pure normalization + semantic-validation pipeline (from legacy and v2 YAML search principals) through the sempai engine so Engine::compile_yaml now returns real QueryPlans carrying Formulas, backed by unit/BDD tests and documentation updates.

Class diagram for canonical Formula model and engine integration

classDiagram
    direction LR

    class Atom {
        <<enum>>
        +Pattern : String
        +Regex : String
        +TreeSitterQuery : String
    }

    class Decorated_T_ {
        <<generic>>
        +node : T
        +where_clauses : Vec~serde_json::Value~
        +as_name : Option~String~
        +fix : Option~String~
    }

    class Formula {
        <<enum>>
        +Atom : Atom
        +Not : Decorated~Formula~
        +Inside : Decorated~Formula~
        +Anywhere : Decorated~Formula~
        +And : Vec~Decorated~Formula~~
        +Or : Vec~Decorated~Formula~~
        +Constraint : serde_json::Value
    }

    class SearchQueryPrincipal {
        <<from_sempai_yaml>>
        +Legacy : LegacyFormula
        +Match : MatchFormula
        +ProjectDependsOn
    }

    class LegacyFormula {
        <<from_sempai_yaml>>
        +Pattern : String
        +PatternRegex : String
        +Patterns : Vec~LegacyClause~
        +PatternEither : Vec~LegacyFormula~
        +PatternNot : LegacyValue
        +PatternInside : LegacyValue
        +PatternNotInside : LegacyValue
        +PatternNotRegex : String
        +Anywhere : LegacyValue
    }

    class LegacyValue {
        <<from_sempai_yaml>>
        +String : String
        +Formula : LegacyFormula
    }

    class LegacyClause {
        <<from_sempai_yaml>>
        +Formula : LegacyFormula
        +Constraint : serde_json::Value
    }

    class MatchFormula {
        <<from_sempai_yaml>>
        +Pattern : String
        +PatternObject : String
        +Regex : String
        +All : Vec~MatchFormula~
        +Any : Vec~MatchFormula~
        +Not : Box~MatchFormula~
        +Inside : Box~MatchFormula~
        +Anywhere : Box~MatchFormula~
        +Decorated : MatchDecorated
    }

    class MatchDecorated {
        <<from_sempai_yaml>>
        +formula : Box~MatchFormula~
        +where_clauses : Vec~serde_json::Value~
        +as_name : Option~String~
        +fix : Option~String~
    }

    class DiagnosticReport {
        <<from_sempai_core>>
        +codes : Vec~DiagnosticCode~
    }

    class QueryPlan {
        +rule_id : String
        +language : Language
        +formula : Option~Formula~
        +formula() : Option~&Formula~
    }

    class Engine {
        +compile_yaml(rule_file_path : &str) : Result~Vec~QueryPlan~, DiagnosticReport~
        +execute(plans : &Vec~QueryPlan~) : Result~(), DiagnosticReport~
    }

    class NormaliseModule {
        <<sempai_normalise>>
        +normalise_search_principal(principal : &SearchQueryPrincipal) : Result~Formula, DiagnosticReport~
        +normalise_legacy(formula : &LegacyFormula) : Result~Formula, DiagnosticReport~
        +normalise_match(formula : &MatchFormula) : Result~Formula, DiagnosticReport~
        +validate_formula_constraints(formula : &Formula) : Result~(), DiagnosticReport~
    }

    Formula o--> Atom
    Formula "1" o--> "many" Decorated_T_
    Decorated_T_ "1" o--> "1" Formula : T=Formula

    SearchQueryPrincipal --> LegacyFormula
    SearchQueryPrincipal --> MatchFormula

    LegacyFormula --> LegacyClause
    LegacyFormula --> LegacyValue
    LegacyClause --> LegacyFormula

    MatchFormula --> MatchDecorated
    MatchDecorated --> MatchFormula

    NormaliseModule ..> SearchQueryPrincipal
    NormaliseModule ..> LegacyFormula
    NormaliseModule ..> MatchFormula
    NormaliseModule ..> Formula
    NormaliseModule ..> DiagnosticReport

    Engine --> QueryPlan
    Engine ..> NormaliseModule
    QueryPlan ..> Formula

File-Level Changes

Change	Details	Files
Introduce canonical Formula/Atom/Decorated types as stable core API and export them from sempai_core.	Define Formula enum (Atom/Not/Inside/Anywhere/And/Or/Constraint) and Atom enum (Pattern/Regex/TreeSitterQuery) following the normalised formula model. Define generic Decorated wrapper to carry where/as/fix metadata alongside formula nodes. Export the new formula module from sempai_core::lib and add unit tests for Formula construction, equality, Atom variants, and Decorated handling.	`crates/sempai-core/src/formula.rs` `crates/sempai-core/src/lib.rs` `crates/sempai-core/src/tests/formula_tests.rs`
Add pure normalization functions converting legacy and v2 YAML models into the canonical Formula, plus semantic constraint validation.	Implement normalise_search_principal entrypoint that dispatches from SearchQueryPrincipal to legacy/v2-specific normalisers. Map LegacyFormula and related LegacyValue/LegacyClause variants into Formula according to the documented operator mapping (patterns→And, pattern-either→Or, pattern-not-inside→Not(Inside(...)), constraints→Formula::Constraint, etc.). Map v2 MatchFormula (pattern/regex/all/any/not/inside/anywhere/decorated) into Formula while preserving Decorated metadata via Decorated. Implement validate_formula_constraints to enforce InvalidNotInOr and MissingPositiveTermInAnd, returning DiagnosticReport on violations. Keep normalization and validation as pure, side-effect-free functions in the sempai normalise module split into legacy, v2, and constraints submodules.	`crates/sempai/src/normalise.rs` `crates/sempai/src/normalise/legacy.rs` `crates/sempai/src/normalise/v2.rs` `crates/sempai/src/normalise/constraints.rs` `crates/sempai/src/tests/normalise_tests.rs`
Wire normalization + constraint validation into Engine::compile_yaml and extend QueryPlan to carry Formulas.	Replace the NOT_IMPLEMENTED placeholder in Engine::compile_yaml with a pipeline that parses search rules, normalises their principals into Formula, validates constraints, and either produces QueryPlans or diagnostics. Change QueryPlan to hold an Option, with None used for ProjectDependsOn passthrough plans, and add a formula() accessor. Ensure ProjectDependsOn principals bypass formula normalization so non-formula semantics can be supported later without breaking the pipeline. Adapt existing engine tests and add BDD coverage to assert that compile_yaml returns real plans for valid rules and stable error codes for invalid shapes.	`crates/sempai/src/engine.rs` `crates/sempai/src/tests/engine_tests.rs` `crates/sempai/tests/features/formula_normalization.feature` `crates/sempai/src/tests/behaviour.rs`
Extend tests and BDD scenarios to cover normalization mappings, constraint violations, and legacy↔v2 equivalence.	Add unit tests that exercise legacy→Formula and v2 match→Formula mappings, deep nesting, decorated metadata propagation, and constraint preservation. Add tests specifically targeting InvalidNotInOr and MissingPositiveTermInAnd, including edge cases like only Inside/Anywhere terms and constraint-only conjunctions. Introduce BDD scenarios that compile legacy and v2 rules and compare resulting Formulas, and scenarios that assert diagnostics for invalid rules.	`crates/sempai-core/src/tests/formula_tests.rs` `crates/sempai/src/tests/normalise_tests.rs` `crates/sempai/src/tests/constraint_tests.rs` `crates/sempai/tests/features/formula_normalization.feature`
Update documentation and roadmap to describe the canonical formula model, normalization behaviour, and milestone completion.	Add a new execplan doc 4-1-5-normalization-into-canonical-formula-model.md describing purpose, constraints, mappings, and implementation plan. Update sempai-query-language-design.md with the normalised Formula model, operator mappings, constraint approach, and crate-placement rationale. Update users-guide.md to describe compile_yaml now returning real query plans and listing possible semantic error codes. Mark roadmap item 4.1.5 as done once all tests and gates pass.	`docs/execplans/4-1-5-normalization-into-canonical-formula-model.md` `docs/sempai-query-language-design.md` `docs/users-guide.md` `docs/roadmap.md`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

Added a new section 'Practice documentation' to the 4-1-5 normalization into canonical formula model docs. This section lists relevant project guidance documents covering Testing, Design and architecture, Code quality, and Configuration aspects pertinent to this milestone. Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>

- Introduce a canonical normalized formula model (Formula, Atom, Decorated) in sempai_core - Add normalization modules in sempai to lower legacy and v2 Semgrep queries into Formula - Implement semantic constraint validation on formulas (e.g., disallow Not in Or, require positive terms in And) - Change Engine::compile_yaml to produce QueryPlans with normalized formulas - Support ProjectDependsOn rules producing plans without formulas - Add extensive unit and BDD tests for normalization and constraints - Update docs and user guide to reflect normalization and new diagnostics - Remove previous not-implemented placeholders from query compilation This implements roadmap item 4.1.5 and completes the canonical internal formula model and query normalization pipeline. Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>

codescene-delta-analysis

Gates Failed
Enforce advisory code health rules (3 files with Code Duplication)

Gates Passed
5 Quality Gates Passed

See analysis details in CodeScene

Reason for failure

Enforce advisory code health rules	Violations	Code Health Impact
behaviour.rs	1 advisory rule	10.00 → 9.39	Suppress
constraint_tests.rs	1 advisory rule	9.39	Suppress
normalise_tests.rs	1 advisory rule	9.39	Suppress

Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.

codescene-delta-analysis · 2026-04-13T19:17:55Z

 #[then("compilation fails with code {code}")]
 fn then_compilation_fails(world: &mut TestWorld, code: QuotedString) {
-    assert_diagnostic_code(
+    let report = extract_report(


❌ New issue: Code Duplication
The module contains 4 functions with similar structure: then_compilation_fails,then_execution_fails,then_first_plan_has_formula,then_first_plan_has_no_formula

_Suppress

codescene-delta-analysis · 2026-04-13T19:17:56Z

+fn or_with_not_child_is_rejected() {
+    let formula = Formula::Or(vec![
+        bare(pat("a")),
+        bare(Formula::Not(Box::new(bare(pat("b"))))),
+    ]);
+    let err = validate_formula_constraints(&formula).expect_err("should fail");
+    let code = err.diagnostics().first().expect("at least one").code();
+    assert_eq!(code, DiagnosticCode::ESempaiInvalidNotInOr);
+}


❌ New issue: Code Duplication
The module contains 4 functions with similar structure: and_with_no_positive_terms_is_rejected,nested_and_inside_or_with_no_positive_term_is_rejected,nested_or_with_not_inside_and_is_rejected,or_with_not_child_is_rejected

_Suppress

codescene-delta-analysis · 2026-04-13T19:17:56Z

+fn legacy_patterns_normalises_to_and() {
+    let principal = SearchQueryPrincipal::Legacy(LegacyFormula::Patterns(vec![
+        LegacyClause::Formula(LegacyFormula::Pattern(String::from("a"))),
+        LegacyClause::Formula(LegacyFormula::PatternNot(Box::new(LegacyValue::String(
+            String::from("b"),
+        )))),
+    ]));
+    let result = normalise_search_principal(&principal).expect("ok");
+    let expected = Formula::And(vec![
+        bare(pat("a")),
+        bare(Formula::Not(Box::new(bare(pat("b"))))),
+    ]);
+    assert_eq!(result, Some(expected));
+}


❌ New issue: Code Duplication
The module contains 4 functions with similar structure: legacy_pattern_either_normalises_to_or,legacy_patterns_normalises_to_and,v2_all_normalises_to_and,v2_any_normalises_to_or

_Suppress

This comment was marked as outdated.

Sign in to view

leynos changed the title ~~Plan normalization into canonical Formula model~~ Add exec plan for normalization into canonical Formula model (4.1.5) Apr 13, 2026

leynos changed the title ~~Add exec plan for normalization into canonical Formula model (4.1.5)~~ Implement canonical Formula normalization; wire into engine (4.1.5) Apr 13, 2026

codescene-delta-analysis Bot reviewed Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement canonical Formula normalization; wire into engine (4.1.5)#105

Implement canonical Formula normalization; wire into engine (4.1.5)#105
leynos wants to merge 3 commits intomainfrom
implement-normalization-canonical-formula-43j33j

leynos commented Apr 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 12, 2026 •

edited

Loading

Review skipped

Uh oh!

This comment was marked as outdated.

Uh oh!

sourcery-ai Bot commented Apr 12, 2026

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

This comment was marked as outdated.

Uh oh!

codescene-delta-analysis Bot left a comment

Uh oh!

codescene-delta-analysis Bot Apr 13, 2026

Uh oh!

codescene-delta-analysis Bot Apr 13, 2026

Uh oh!

codescene-delta-analysis Bot Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leynos commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Artefacts

Rationale & design decisions

Testing plan

Milestones progress (high level)

Task

Uh oh!

coderabbitai Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

This comment was marked as outdated.

Uh oh!

sourcery-ai Bot commented Apr 12, 2026

Reviewer's Guide

Class diagram for canonical Formula model and engine integration

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

This comment was marked as outdated.

Uh oh!

codescene-delta-analysis Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codescene-delta-analysis Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

codescene-delta-analysis Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

codescene-delta-analysis Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leynos commented Apr 12, 2026 •

edited

Loading

coderabbitai Bot commented Apr 12, 2026 •

edited

Loading