Skip to content

Implement canonical Formula normalization; wire into engine (4.1.5)#105

Draft
leynos wants to merge 3 commits intomainfrom
implement-normalization-canonical-formula-43j33j
Draft

Implement canonical Formula normalization; wire into engine (4.1.5)#105
leynos wants to merge 3 commits intomainfrom
implement-normalization-canonical-formula-43j33j

Conversation

@leynos
Copy link
Copy Markdown
Owner

@leynos leynos commented Apr 12, 2026

Summary

  • Implement end-to-end normalization of both legacy Semgrep operators and v2 match operators into a canonical Formula model, expose a stable interface, and wire this through the engine. This completes milestone 4.1.5 and replaces the previous documentation-only patch with a full implementation path.
  • The ExecPlan document remains the planning artifact for this milestone and has been complemented by concrete code changes and tests.

Changes

  • Core formula model
    • New: crates/sempai-core/src/formula.rs introducing Canonical normalised formula model with
      • Atom enum (Pattern, Regex, TreeSitterQuery)
      • Decorated wrapper carrying where/as/fix metadata
      • Formula enum: Atom, Not, Inside, Anywhere, And, Or, Constraint
    • Public re-exports added in crates/sempai-core/src/lib.rs for Atom, Decorated, and Formula.
  • Normalisation pipeline (sempai crate)
    • New modular normalization: crates/sempai/src/normalise/mod.rs plus
      • constraints.rs, legacy.rs, v2.rs
    • Public function: normalise_search_principal(...) to map from YAML/primitives to Option (None for ProjectDependsOn)
    • Public function: validate_formula_constraints(...) to enforce semantic invariants on the canonical tree
  • Engine integration
    • crates/sempai/src/engine.rs updated to:
      • Carry an optional Formula in QueryPlan instead of a placeholder
      • Wire normalization into compile_yaml: for each search-rule, normalise the principal, validate constraints, and attach the resulting Formula to the plan
      • Expose a formula() accessor on QueryPlan
  • Tests
    • Unit tests for formula types: crates/sempai-core/src/tests/formula_tests.rs
    • Normalisation tests: crates/sempai/src/tests/normalise_tests.rs
    • Constraint validation tests: crates/sempai/src/tests/constraint_tests.rs
    • Engine tests updated to expect plans carrying Formula and to validate plan.formula() behaviour
    • Re-export tests updated to reflect new Formula types
  • Behaviour and features
    • BDD/tests feature file for formula normalization added: crates/sempai/tests/features/formula_normalization.feature
    • Updated tests for engine behaviour to verify plan count and presence/absence of formulas
  • Documentation and planning artefacts
    • ExecPlan added/updated: docs/execplans/4-1-5-normalization-into-canonical-formula-model.md
    • Roadmap and design references adjusted to reflect normalization path and API changes
  • Task reference

Artefacts

  • New files:
    • crates/sempai-core/src/formula.rs
    • crates/sempai-core/src/tests/formula_tests.rs
    • crates/sempai/src/normalise/mod.rs
    • crates/sempai/src/normalise/constraints.rs
    • crates/sempai/src/normalise/legacy.rs
    • crates/sempai/src/normalise/v2.rs
    • crates/sempai/src/normalise/README (if any) and related test scaffolding
    • crates/sempai/src/tests/normalise_tests.rs
    • crates/sempai/src/tests/constraint_tests.rs
    • crates/sempai/src/tests/engine_tests.rs (updated expectations)
    • crates/sempai/tests/features/formula_normalization.feature
    • docs/execplans/4-1-5-normalization-into-canonical-formula-model.md
  • The ExecPlan document remains the gating artifact for this milestone and is now complemented by concrete code/tests.

Rationale & design decisions

  • Crate placement: canonical Formula and its types live in sempai-core to serve downstream consumers; normalization logic lives in the sempai crate to avoid circular dependencies with sempai_yaml. This aligns with the Option B approach described in the ExecPlan.
  • Semantics and constraints: preservation of metavariable constraints via Formula::Constraint while introducing deterministic semantic constraints via validate_formula_constraints. This ensures a clean separation between parsing/structural validation and semantic rule validation.
  • ProjectDependsOn passthrough: Rules of this kind produce plans with formula = None, allowing downstream handling without forcing a formula.
  • Decorated metadata: v2 Decorated metadata is preserved on inner formulaes; top-level normalization returns the inner Formula node, but metadata continues to flow through nested Decorated contexts.
  • Test-intensive approach: unit tests for the core Formula model, normalization maps, semantic constraints, and engine integration tests ensure broad coverage of happy paths, edge cases, and error states before stable API usage.

Testing plan

  • Unit tests:
    • Formula type construction and equality (atoms, regex, TSQuery)
    • Decorated wrapper behaviour
    • Formula construction (Not/Inside/Anywhere/And/Or/Constraint) and is_positive_term classification
  • Normalisation tests:
    • Legacy and v2 normalization paths (Pattern, PatternRegex, Patterns, PatternEither, etc.)
    • Paired equivalence tests for legacy vs v2 mappings
    • Decorated v2 metadata preservation within nested structures
    • Deep nesting cases
  • Constraint tests:
    • InvalidNotInOr when Not appears under Or
    • MissingPositiveTermInAnd across various compositions, including constraints edge cases
  • Engine tests:
    • compile_yaml produces plans with formulas for valid rules
    • compile_yaml produces plans without formulas for ProjectDependsOn rules
    • Execution path remains not implemented for now, with plans carrying the formula for future backend
  • Behaviour/BDD tests:
    • Updated/added scenarios to reflect formula normalization outcomes

Milestones progress (high level)

  • Status: COMPLETE for 4.1.5 (Implement legacy and v2 normalization into one canonical Formula model with semantic constraint checks). All tests and gates updated accordingly.

Task

📎 Task: https://www.devboxer.com/task/21bc5583-37ef-4f65-88fa-8c6f1f8f662c

Add a canonical Formula enum and associated Atom and Decorated types to sempai_core.
Implement normalization of legacy and v2 Semgrep operators into the canonical Formula model.
Enforce semantic constraints rejecting invalid formula shapes with diagnostic codes.
Update Engine::compile_yaml to produce QueryPlan structs carrying normalized Formulas instead of placeholders.
Preserve legacy constraints as opaque Formula::Constraint nodes.
Add extensive unit and BDD tests covering normalization and constraint validation.
Update documentation and roadmap to reflect new normalization and validation behavior.

This completes roadmap item 4.1.5 with fully tested normalization pipeline and user-visible error reporting.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 12, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0004a6be-9753-4133-b494-1a21ed7131dd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch implement-normalization-canonical-formula-43j33j

Comment @coderabbitai help to get the list of available commands and usage tips.

codescene-delta-analysis[bot]

This comment was marked as outdated.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Apr 12, 2026

Reviewer's Guide

Implements a canonical Formula model in sempai_core and wires a pure normalization + semantic-validation pipeline (from legacy and v2 YAML search principals) through the sempai engine so Engine::compile_yaml now returns real QueryPlans carrying Formulas, backed by unit/BDD tests and documentation updates.

Class diagram for canonical Formula model and engine integration

classDiagram
    direction LR

    class Atom {
        <<enum>>
        +Pattern : String
        +Regex : String
        +TreeSitterQuery : String
    }

    class Decorated_T_ {
        <<generic>>
        +node : T
        +where_clauses : Vec~serde_json::Value~
        +as_name : Option~String~
        +fix : Option~String~
    }

    class Formula {
        <<enum>>
        +Atom : Atom
        +Not : Decorated~Formula~
        +Inside : Decorated~Formula~
        +Anywhere : Decorated~Formula~
        +And : Vec~Decorated~Formula~~
        +Or : Vec~Decorated~Formula~~
        +Constraint : serde_json::Value
    }

    class SearchQueryPrincipal {
        <<from_sempai_yaml>>
        +Legacy : LegacyFormula
        +Match : MatchFormula
        +ProjectDependsOn
    }

    class LegacyFormula {
        <<from_sempai_yaml>>
        +Pattern : String
        +PatternRegex : String
        +Patterns : Vec~LegacyClause~
        +PatternEither : Vec~LegacyFormula~
        +PatternNot : LegacyValue
        +PatternInside : LegacyValue
        +PatternNotInside : LegacyValue
        +PatternNotRegex : String
        +Anywhere : LegacyValue
    }

    class LegacyValue {
        <<from_sempai_yaml>>
        +String : String
        +Formula : LegacyFormula
    }

    class LegacyClause {
        <<from_sempai_yaml>>
        +Formula : LegacyFormula
        +Constraint : serde_json::Value
    }

    class MatchFormula {
        <<from_sempai_yaml>>
        +Pattern : String
        +PatternObject : String
        +Regex : String
        +All : Vec~MatchFormula~
        +Any : Vec~MatchFormula~
        +Not : Box~MatchFormula~
        +Inside : Box~MatchFormula~
        +Anywhere : Box~MatchFormula~
        +Decorated : MatchDecorated
    }

    class MatchDecorated {
        <<from_sempai_yaml>>
        +formula : Box~MatchFormula~
        +where_clauses : Vec~serde_json::Value~
        +as_name : Option~String~
        +fix : Option~String~
    }

    class DiagnosticReport {
        <<from_sempai_core>>
        +codes : Vec~DiagnosticCode~
    }

    class QueryPlan {
        +rule_id : String
        +language : Language
        +formula : Option~Formula~
        +formula() : Option~&Formula~
    }

    class Engine {
        +compile_yaml(rule_file_path : &str) : Result~Vec~QueryPlan~, DiagnosticReport~
        +execute(plans : &Vec~QueryPlan~) : Result~(), DiagnosticReport~
    }

    class NormaliseModule {
        <<sempai_normalise>>
        +normalise_search_principal(principal : &SearchQueryPrincipal) : Result~Formula, DiagnosticReport~
        +normalise_legacy(formula : &LegacyFormula) : Result~Formula, DiagnosticReport~
        +normalise_match(formula : &MatchFormula) : Result~Formula, DiagnosticReport~
        +validate_formula_constraints(formula : &Formula) : Result~(), DiagnosticReport~
    }

    Formula o--> Atom
    Formula "1" o--> "many" Decorated_T_
    Decorated_T_ "1" o--> "1" Formula : T=Formula

    SearchQueryPrincipal --> LegacyFormula
    SearchQueryPrincipal --> MatchFormula

    LegacyFormula --> LegacyClause
    LegacyFormula --> LegacyValue
    LegacyClause --> LegacyFormula

    MatchFormula --> MatchDecorated
    MatchDecorated --> MatchFormula

    NormaliseModule ..> SearchQueryPrincipal
    NormaliseModule ..> LegacyFormula
    NormaliseModule ..> MatchFormula
    NormaliseModule ..> Formula
    NormaliseModule ..> DiagnosticReport

    Engine --> QueryPlan
    Engine ..> NormaliseModule
    QueryPlan ..> Formula
Loading

File-Level Changes

Change Details Files
Introduce canonical Formula/Atom/Decorated types as stable core API and export them from sempai_core.
  • Define Formula enum (Atom/Not/Inside/Anywhere/And/Or/Constraint) and Atom enum (Pattern/Regex/TreeSitterQuery) following the normalised formula model.
  • Define generic Decorated wrapper to carry where/as/fix metadata alongside formula nodes.
  • Export the new formula module from sempai_core::lib and add unit tests for Formula construction, equality, Atom variants, and Decorated handling.
crates/sempai-core/src/formula.rs
crates/sempai-core/src/lib.rs
crates/sempai-core/src/tests/formula_tests.rs
Add pure normalization functions converting legacy and v2 YAML models into the canonical Formula, plus semantic constraint validation.
  • Implement normalise_search_principal entrypoint that dispatches from SearchQueryPrincipal to legacy/v2-specific normalisers.
  • Map LegacyFormula and related LegacyValue/LegacyClause variants into Formula according to the documented operator mapping (patterns→And, pattern-either→Or, pattern-not-inside→Not(Inside(...)), constraints→Formula::Constraint, etc.).
  • Map v2 MatchFormula (pattern/regex/all/any/not/inside/anywhere/decorated) into Formula while preserving Decorated metadata via Decorated.
  • Implement validate_formula_constraints to enforce InvalidNotInOr and MissingPositiveTermInAnd, returning DiagnosticReport on violations.
  • Keep normalization and validation as pure, side-effect-free functions in the sempai normalise module split into legacy, v2, and constraints submodules.
crates/sempai/src/normalise.rs
crates/sempai/src/normalise/legacy.rs
crates/sempai/src/normalise/v2.rs
crates/sempai/src/normalise/constraints.rs
crates/sempai/src/tests/normalise_tests.rs
Wire normalization + constraint validation into Engine::compile_yaml and extend QueryPlan to carry Formulas.
  • Replace the NOT_IMPLEMENTED placeholder in Engine::compile_yaml with a pipeline that parses search rules, normalises their principals into Formula, validates constraints, and either produces QueryPlans or diagnostics.
  • Change QueryPlan to hold an Option, with None used for ProjectDependsOn passthrough plans, and add a formula() accessor.
  • Ensure ProjectDependsOn principals bypass formula normalization so non-formula semantics can be supported later without breaking the pipeline.
  • Adapt existing engine tests and add BDD coverage to assert that compile_yaml returns real plans for valid rules and stable error codes for invalid shapes.
crates/sempai/src/engine.rs
crates/sempai/src/tests/engine_tests.rs
crates/sempai/tests/features/formula_normalization.feature
crates/sempai/src/tests/behaviour.rs
Extend tests and BDD scenarios to cover normalization mappings, constraint violations, and legacy↔v2 equivalence.
  • Add unit tests that exercise legacy→Formula and v2 match→Formula mappings, deep nesting, decorated metadata propagation, and constraint preservation.
  • Add tests specifically targeting InvalidNotInOr and MissingPositiveTermInAnd, including edge cases like only Inside/Anywhere terms and constraint-only conjunctions.
  • Introduce BDD scenarios that compile legacy and v2 rules and compare resulting Formulas, and scenarios that assert diagnostics for invalid rules.
crates/sempai-core/src/tests/formula_tests.rs
crates/sempai/src/tests/normalise_tests.rs
crates/sempai/src/tests/constraint_tests.rs
crates/sempai/tests/features/formula_normalization.feature
Update documentation and roadmap to describe the canonical formula model, normalization behaviour, and milestone completion.
  • Add a new execplan doc 4-1-5-normalization-into-canonical-formula-model.md describing purpose, constraints, mappings, and implementation plan.
  • Update sempai-query-language-design.md with the normalised Formula model, operator mappings, constraint approach, and crate-placement rationale.
  • Update users-guide.md to describe compile_yaml now returning real query plans and listing possible semantic error codes.
  • Mark roadmap item 4.1.5 as done once all tests and gates pass.
docs/execplans/4-1-5-normalization-into-canonical-formula-model.md
docs/sempai-query-language-design.md
docs/users-guide.md
docs/roadmap.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Added a new section 'Practice documentation' to the 4-1-5 normalization into canonical formula model docs. This section lists relevant project guidance documents covering Testing, Design and architecture, Code quality, and Configuration aspects pertinent to this milestone.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
codescene-delta-analysis[bot]

This comment was marked as outdated.

@leynos leynos changed the title Plan normalization into canonical Formula model Add exec plan for normalization into canonical Formula model (4.1.5) Apr 13, 2026
- Introduce a canonical normalized formula model (Formula, Atom, Decorated) in sempai_core
- Add normalization modules in sempai to lower legacy and v2 Semgrep queries into Formula
- Implement semantic constraint validation on formulas (e.g., disallow Not in Or, require positive terms in And)
- Change Engine::compile_yaml to produce QueryPlans with normalized formulas
- Support ProjectDependsOn rules producing plans without formulas
- Add extensive unit and BDD tests for normalization and constraints
- Update docs and user guide to reflect normalization and new diagnostics
- Remove previous not-implemented placeholders from query compilation

This implements roadmap item 4.1.5 and completes the canonical internal formula model and query normalization pipeline.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
@leynos leynos changed the title Add exec plan for normalization into canonical Formula model (4.1.5) Implement canonical Formula normalization; wire into engine (4.1.5) Apr 13, 2026
Copy link
Copy Markdown

@codescene-delta-analysis codescene-delta-analysis Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gates Failed
Enforce advisory code health rules (3 files with Code Duplication)

Gates Passed
5 Quality Gates Passed

See analysis details in CodeScene

Reason for failure
Enforce advisory code health rules Violations Code Health Impact
behaviour.rs 1 advisory rule 10.00 → 9.39 Suppress
constraint_tests.rs 1 advisory rule 9.39 Suppress
normalise_tests.rs 1 advisory rule 9.39 Suppress

Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.

#[then("compilation fails with code {code}")]
fn then_compilation_fails(world: &mut TestWorld, code: QuotedString) {
assert_diagnostic_code(
let report = extract_report(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ New issue: Code Duplication
The module contains 4 functions with similar structure: then_compilation_fails,then_execution_fails,then_first_plan_has_formula,then_first_plan_has_no_formula

Suppress

Comment on lines +24 to +32
fn or_with_not_child_is_rejected() {
let formula = Formula::Or(vec![
bare(pat("a")),
bare(Formula::Not(Box::new(bare(pat("b"))))),
]);
let err = validate_formula_constraints(&formula).expect_err("should fail");
let code = err.diagnostics().first().expect("at least one").code();
assert_eq!(code, DiagnosticCode::ESempaiInvalidNotInOr);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ New issue: Code Duplication
The module contains 4 functions with similar structure: and_with_no_positive_terms_is_rejected,nested_and_inside_or_with_no_positive_term_is_rejected,nested_or_with_not_inside_and_is_rejected,or_with_not_child_is_rejected

Suppress

Comment on lines +43 to +56
fn legacy_patterns_normalises_to_and() {
let principal = SearchQueryPrincipal::Legacy(LegacyFormula::Patterns(vec![
LegacyClause::Formula(LegacyFormula::Pattern(String::from("a"))),
LegacyClause::Formula(LegacyFormula::PatternNot(Box::new(LegacyValue::String(
String::from("b"),
)))),
]));
let result = normalise_search_principal(&principal).expect("ok");
let expected = Formula::And(vec![
bare(pat("a")),
bare(Formula::Not(Box::new(bare(pat("b"))))),
]);
assert_eq!(result, Some(expected));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ New issue: Code Duplication
The module contains 4 functions with similar structure: legacy_pattern_either_normalises_to_or,legacy_patterns_normalises_to_and,v2_all_normalises_to_and,v2_any_normalises_to_or

Suppress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant