Skip to content

Add sempai_yaml: YAML parser front-end, engine wiring, and tests#85

Merged
leynos merged 26 commits intomainfrom
implement-yaml-rule-parsing-yt0shv
Mar 28, 2026
Merged

Add sempai_yaml: YAML parser front-end, engine wiring, and tests#85
leynos merged 26 commits intomainfrom
implement-yaml-rule-parsing-yt0shv

Conversation

@leynos
Copy link
Copy Markdown
Owner

@leynos leynos commented Mar 21, 2026

Summary

  • Adds a dedicated YAML front-end crate (sempai_yaml) for Semgrep-compatible YAML rule parsing using Saphyr and serde-saphyr. Exposes typed models (RuleFile, Rule, RuleMode, RuleSeverity, RulePrincipal, etc.) and a parser entrypoint to convert YAML into these models with structured diagnostics.
  • Integrates with the existing sempai/Engine surface by wiring Engine::compile_yaml to parse YAML via sempai_yaml and emit parser/schema diagnostics on errors; on success, returns a NOT_IMPLEMENTED placeholder until normalization is implemented.
  • Includes tests, source-location mapping, and behaviour tests to exercise both happy paths and error paths. Documentation and planning artifacts updated accordingly (ExecPlan 4.1.3 and roadmap changes).
  • Updates to workspace and public API surface to include the new crate and wire it into the public facade.

Changes

  • Added new crate: crates/sempai-yaml with:
    • Cargo.toml, src/lib.rs, src/model.rs, src/parser.rs, src/raw.rs, src/source_map.rs
    • Tests: behaviours (crates/sempai-yaml/src/tests/behaviour.rs), parser_tests.rs, taint_tests.rs, test_helpers.rs, mod.rs
    • Feature tests: crates/sempai-yaml/tests/features/sempai_yaml.feature
  • Updated workspace to include the new crate in Cargo.toml at the repo root.
  • Wiring in sempai to parse YAML via the new parser:
    • crates/sempai/src/engine.rs: Engine::compile_yaml now parses YAML using sempai_yaml::parse_rule_file and returns real parser diagnostics on errors; on success, it returns a NOT_IMPLEMENTED placeholder until normalization is implemented.
  • Updated tests and test features to reflect YAML parsing behavior:
    • crates/sempai/src/tests/engine_tests.rs: added tests for YAML parse diagnostics (malformed YAML, schema-invalid, and post-parse NOT_IMPLEMENTED behavior).
    • crates/sempai/tests/features/sempai_engine.feature and related tests updated to reflect YAML parsing outcomes.
  • Documentation and planning artifacts:
    • docs/execplans/4-1-3-yaml-rule-parsing-via-saphyr-and-serde-saphyr.md added (ExecPlan draft/done progression).
    • docs/roadmap.md updated to mark 4.1.3 as done.
    • docs/sempai-query-language-design.md notes about the YAML parser boundary and diagnostic mapping.
    • docs/users-guide.md updated to describe the new sempai_yaml crate and the compile_yaml behaviour once parsing succeeds but normalization is not yet implemented.
    • Updated references to reflect the crate boundary and parser boundary across docs.

Rationale

  • This work establishes the YAML parsing front-end as a dedicated crate (sempai_yaml), decoupling parsing from runtime execution and allowing schema-aligned rule models to evolve independently of normalization/execution. It implements the parsing boundary described in the 4.1.x roadmap and prepares the ground for 4.1.4/4.1.5 while keeping the public API stable and recoverable via structured diagnostics.

Plan of Work (high level)

  • Stage A: Scaffold YAML crate, wire into workspace, write red-path unit tests.
  • Stage B: Implement schema-aligned YAML rule models and parser entrypoints.
  • Stage C: Add structured diagnostics and source-location mapping.
  • Stage D: Wire Engine::compile_yaml to the parser boundary and surface parser diagnostics.
  • Stage E: Documentation and roadmap alignment.
  • Stage F: Quality gates and test coverage.

Test Plan

  • Unit tests verify that the parser properly deserializes YAML into RuleFile/Rule models and that basic queries (legacy match and v2 match) are parsed into the appropriate RulePrincipal variants.
  • Behaviour tests cover happy-path parsing, malformed YAML (E_SEMPAI_YAML_PARSE), and schema-invalid shapes (E_SEMPAI_SCHEMA_INVALID).
  • Engine tests verify that Engine::compile_yaml now returns parser diagnostics for malformed or invalid YAML and yields a NOT_IMPLEMENTED placeholder after successful parsing pending normalization.

Documentation & Roadmap

  • ExecPlan 4.1.3 documented in docs/execplans/4-1-3-yaml-rule-parsing-via-saphyr-and-serde-saphyr.md, with progress and plan details.
  • Roadmap updated to reflect 4.1.3 as done; mentions parser boundary and YAML front-end.
  • User guide updated to describe the new sempai_yaml crate and the compile_yaml behaviour once parsing succeeds but normalization is not yet implemented.
  • Sempai design page notes the parser boundary and diagnostic flow for YAML parsing.

Validation & Acceptance Criteria

  • A Semgrep-compatible YAML document with rules: [...] parses into a typed RuleFile model.
  • Rule metadata (id, message, languages, severity, mode) parse from supported YAML forms.
  • Query principals parse for legacy (pattern/pattern-regex/patterns/pattern-either) and v2 (match) forms.
  • Invalid YAML syntax emits E_SEMPAI_YAML_PARSE with a useful primary_span where possible.
  • Structural schema failures emit E_SEMPAI_SCHEMA_INVALID with stable messages and an actionable primary_span when location can be identified.
  • Unit tests and rstest-bdd behaviour tests cover happy, unhappy, and edge cases for the new parser crate and Engine facade.
  • Documentation notes updated to reflect parser decisions and boundary.

Next Steps

  • Await approval to proceed to 4.1.4/4.1.5 for normalization and full QueryPlan generation.
  • Run the full gate sequence (fmt, lint, tests) and ensure all snapshots remain stable.

Links

📎 Task: https://www.devboxer.com/task/67f4f901-cb23-45d8-ae2d-d319e6e63204

…and serde-saphyr

Create a comprehensive execution plan document outlining the design, constraints, risks, progress stages, and validation requirements for milestone 4.1.3. This milestone focuses on implementing parsing of Semgrep-compatible YAML rule files via a new `sempai_yaml` crate using `saphyr` and `serde-saphyr`. The document specifies the planned crate structure, API contract, diagnostic reporting standards, testing strategy, and documentation updates necessary before implementation begins.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Mar 21, 2026

Reviewer's Guide

Introduces a new sempai_yaml crate that parses Semgrep-compatible YAML rule files via Saphyr/serde-saphyr into a typed rule model with structured diagnostics, wires Engine::compile_yaml through this parser while still returning a post-parse NOT_IMPLEMENTED placeholder pending normalization, and updates tests and docs (users guide, design doc, roadmap, exec plan) to reflect the new YAML parsing boundary and diagnostic behavior.

Sequence diagram for Engine::compile_yaml using sempai_yaml parser

sequenceDiagram
  actor Caller
  participant Engine
  participant SempaiYaml as sempai_yaml_parse_rule_file
  participant SourceMap
  participant SerdeSaphyr as serde_saphyr
  participant Diagnostics as DiagnosticReport

  Caller->>Engine: compile_yaml(yaml)
  Engine->>SempaiYaml: parse_rule_file(yaml, None)
  SempaiYaml->>SourceMap: SourceMap::parse(yaml, source_uri)
  SourceMap-->>SempaiYaml: source_map
  SempaiYaml->>SerdeSaphyr: from_str(yaml)

  alt YAML_or_schema_error
    SerdeSaphyr-->>SempaiYaml: Error
    SempaiYaml->>Diagnostics: diagnostic_from_serde(error, source_map)
    SempaiYaml-->>Engine: Err(DiagnosticReport)
    Engine-->>Caller: Err(DiagnosticReport)
  else Parse_ok
    SerdeSaphyr-->>SempaiYaml: RawRuleFile
    SempaiYaml-->>Engine: Ok(RuleFile)
    Engine->>Diagnostics: DiagnosticReport::not_implemented(compile_yaml_query_plan_normalisation)
    Engine-->>Caller: Err(DiagnosticReport)
  end
Loading

Class diagram for new YAML rule model in sempai_yaml

classDiagram
  class RuleFile {
    +Vec~Rule~ rules
    +RuleFile new(rules: Vec~Rule~)
    +rules() Vec~Rule~
  }

  class Rule {
    +String id
    +RuleMode mode
    +Option~String~ message
    +Vec~String~ languages
    +Option~RuleSeverity~ severity
    +Option~String~ min_version
    +Option~String~ max_version
    +RulePrincipal principal
    +id() String
    +mode() RuleMode
    +message() Option~String~
    +languages() Vec~String~
    +severity() Option~RuleSeverity~
    +min_version() Option~String~
    +max_version() Option~String~
    +principal() RulePrincipal
  }

  class RuleMode {
    <<enum>>
    +Search
    +Taint
    +Join
    +Extract
    +Other(String)
    +from_optional(raw: Option~str~) RuleMode
  }

  class RuleSeverity {
    <<enum>>
    +Error
    +Warning
    +Info
    +Inventory
    +Experiment
    +Critical
    +High
    +Medium
    +Low
    +parse(raw: str) Option~RuleSeverity~
  }

  class RulePrincipal {
    <<enum>>
    +Search(SearchQueryPrincipal)
    +Extract(ExtractQueryPrincipal)
    +Taint(TaintQueryPrincipal)
    +Join(Value)
  }

  class SearchQueryPrincipal {
    <<enum>>
    +Legacy(LegacyFormula)
    +Match(MatchFormula)
  }

  class ExtractQueryPrincipal {
    +String dest_language
    +String extract
    +LegacyFormula query
    +dest_language() String
    +extract() String
    +query() LegacyFormula
  }

  class TaintQueryPrincipal {
    <<enum>>
    +New(Value)
    +Legacy(sources: Value, sanitizers: Option~Value~, sinks: Value)
  }

  class LegacyFormula {
    <<enum>>
    +Pattern(String)
    +PatternRegex(String)
    +Patterns(Vec~LegacyClause~)
    +PatternEither(Vec~LegacyFormula~)
    +PatternNot(Box~LegacyValue~)
    +PatternInside(Box~LegacyValue~)
    +PatternNotInside(Box~LegacyValue~)
    +PatternNotRegex(String)
    +Anywhere(Box~LegacyValue~)
  }

  class LegacyValue {
    <<enum>>
    +String(String)
    +Formula(LegacyFormula)
  }

  class LegacyClause {
    <<enum>>
    +Formula(LegacyFormula)
    +Constraint(Value)
  }

  class MatchFormula {
    <<enum>>
    +Pattern(String)
    +PatternObject(String)
    +Regex(String)
    +All(Vec~MatchFormula~)
    +Any(Vec~MatchFormula~)
    +Not(Box~MatchFormula~)
    +Inside(Box~MatchFormula~)
    +Anywhere(Box~MatchFormula~)
    +Decorated(formula: Box~MatchFormula~, where_clauses: Vec~Value~, as_name: Option~String~, fix: Option~String~)
    +decorated(formula: MatchFormula, where_clauses: Vec~Value~, as_name: Option~String~, fix: Option~String~) MatchFormula
  }

  RuleFile --> Rule
  Rule --> RuleMode
  Rule --> RuleSeverity
  Rule --> RulePrincipal
  RulePrincipal --> SearchQueryPrincipal
  RulePrincipal --> ExtractQueryPrincipal
  RulePrincipal --> TaintQueryPrincipal
  SearchQueryPrincipal --> LegacyFormula
  SearchQueryPrincipal --> MatchFormula
  ExtractQueryPrincipal --> LegacyFormula
  TaintQueryPrincipal --> LegacyFormula
  LegacyFormula --> LegacyClause
  LegacyFormula --> LegacyValue
  LegacyClause --> LegacyFormula
  LegacyClause --> LegacyValue
  MatchFormula --> MatchFormula
Loading

File-Level Changes

Change Details Files
Add dedicated sempai_yaml crate that exposes a typed, schema-aligned YAML rule model and parser with source-mapped diagnostics.
  • Create sempai_yaml crate with RuleFile, Rule, RuleMode, RuleSeverity, RulePrincipal and related enums/structs modeling Semgrep rule metadata and query principals (legacy and v2).
  • Implement parse_rule_file(yaml, source_uri) that uses saphyr and serde-saphyr to deserialize YAML into the model, construct DiagnosticReport values on syntax/schema errors, and compute spans via a SourceMap.
  • Model legacy search operators (pattern, pattern-regex, patterns, pattern-either, nested clauses) and v2 match formulas (pattern/regex/all/any/not/inside/anywhere plus decorations like where/as/fix).
  • Handle taint and join rules by preserving their bodies as serde_json::Value for later semantic handling, while still enforcing required metadata fields per mode.
  • Classify serde errors into E_SEMPAI_YAML_PARSE vs E_SEMPAI_SCHEMA_INVALID, mapping serde-saphyr::Error and locations into DiagnosticReport::parser_error and validation_error.
  • Add unit tests and rstest-bdd behaviour tests validating successful parsing, malformed YAML, missing required fields, and diagnostic codes/spans.
crates/sempai-yaml/Cargo.toml
crates/sempai-yaml/src/lib.rs
crates/sempai-yaml/src/model.rs
crates/sempai-yaml/src/parser.rs
crates/sempai-yaml/src/source_map.rs
crates/sempai-yaml/src/tests/mod.rs
crates/sempai-yaml/src/tests/parser_tests.rs
crates/sempai-yaml/src/tests/behaviour.rs
crates/sempai-yaml/tests/features/sempai_yaml.feature
Wire the new YAML parser into the sempai engine facade and refine engine tests to assert diagnostic codes and messages.
  • Update workspace and sempai crate dependencies to include sempai_yaml plus saphyr and serde-saphyr as shared deps.
  • Change Engine::compile_yaml to call sempai_yaml::parse_rule_file and propagate parser/schema diagnostics on error, while returning a NOT_IMPLEMENTED diagnostic labeled as a query-plan normalisation placeholder on success.
  • Refactor engine_tests to use a shared default_engine helper and a first_diagnostic_of_err helper that extracts the first diagnostic code/message from Result<T, DiagnosticReport>.
  • Expand engine tests to cover malformed YAML (E_SEMPAI_YAML_PARSE), schema-invalid rules (E_SEMPAI_SCHEMA_INVALID), and the post-parse NOT_IMPLEMENTED behavior for valid YAML; keep existing not-implemented behavior for compile_dsl and execute but assert via the new helper.
  • Update BDD engine feature specs to assert the new YAML diagnostic codes and to check that the post-parse placeholder message mentions normalisation; add a new step definition to assert substrings in the first diagnostic message.
Cargo.toml
Cargo.lock
crates/sempai/Cargo.toml
crates/sempai/src/engine.rs
crates/sempai/src/tests/engine_tests.rs
crates/sempai/src/tests/behaviour.rs
crates/sempai/tests/features/sempai_engine.feature
Document the YAML parsing boundary and mark roadmap exec plan 4.1.3 as complete.
  • Add an ExecPlan document for 4.1.3 that describes goals, constraints, risks, plan of work, diagnostics behavior, and validation criteria for implementing YAML rule parsing via Saphyr/serde-saphyr.
  • Update the users guide to describe the new sempai_yaml crate, explain that compile_yaml now performs real YAML parsing and emits E_SEMPAI_YAML_PARSE/E_SEMPAI_SCHEMA_INVALID, and clarify that only normalization is still not implemented.
  • Extend the Sempai query language design doc with implementation notes that sempai_yaml owns the parser boundary, explains how search/match/extract/join/taint are decoded or preserved, and notes that Engine::compile_yaml delegates to this crate but still returns NOT_IMPLEMENTED until normalization exists.
  • Mark roadmap item 4.1.3 as done in docs/roadmap.md.
  • Ensure documentation is consistent about the diagnostics JSON schema and the coexistence of real E_SEMPAI_* codes with remaining NOT_IMPLEMENTED stubs for other engine methods.
docs/execplans/4-1-3-yaml-rule-parsing-via-saphyr-and-serde-saphyr.md
docs/users-guide.md
docs/sempai-query-language-design.md
docs/roadmap.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2f6292c1-bfdf-441f-a62c-cc61199e087e

📥 Commits

Reviewing files that changed from the base of the PR and between da053a1 and 7c83871.

📒 Files selected for processing (1)
  • crates/sempai-yaml/src/tests/parser_tests/mode_tests.rs

Summary by CodeRabbit

  • New Features

    • Added YAML rule parsing (legacy and modern "match") with structured diagnostics for malformed YAML and schema errors; engine compile_yaml now surfaces those diagnostics while still returning a placeholder for normalization.
  • Documentation

    • Updated user guide, design doc and roadmap to reflect YAML parsing behaviour and diagnostic contracts.
  • Tests

    • Added extensive unit and behaviour tests covering parsing, validation and diagnostic expectations.

Walkthrough

Add a new crate crates/sempai-yaml implementing Semgrep-compatible YAML rule parsing, source-span extraction, schema-aligned models and diagnostics; wire sempai::Engine::compile_yaml to delegate to the new parser and propagate parse/schema diagnostics.

Changes

Cohort / File(s) Summary
Workspace & Manifests
Cargo.toml, crates/sempai/Cargo.toml
Add crates/sempai-yaml to workspace; introduce workspace deps serde-saphyr = "0.0.22" and saphyr = "0.0.6"; add sempai_yaml as a dependency of the sempai crate.
Crate root & public API
crates/sempai-yaml/src/lib.rs
Create crate root, declare modules, re-export parse_rule_file and core model types for external use.
Core models
crates/sempai-yaml/src/model.rs
Add typed rule models: RuleFile, Rule, RuleMode, RuleSeverity, RulePrincipal variants, LegacyFormula, MatchFormula, accessors and helpers.
Raw schema & conversion
crates/sempai-yaml/src/raw.rs
Add serde-deserialisable raw types mirroring YAML schema; implement conversions to typed models via TryFrom; surface schema diagnostics and enforce operator-count constraints.
Parser implementation
crates/sempai-yaml/src/parser/mod.rs, crates/sempai-yaml/src/parser/builders.rs
Implement parse_rule_file(yaml, source_uri) using serde_saphyr; build rules with mode-specific builders (search, extract, join, taint); validate principal-family fields; map serde/location errors to structured diagnostics.
Source mapping
crates/sempai-yaml/src/source_map.rs
Add SourceMap to extract document, rules container and per-rule spans from marked YAML; convert serde_saphyr::Location to sempai_core::SourceSpan.
Engine integration
crates/sempai/src/engine.rs
Change Engine::compile_yaml(&self, yaml: &str) to call parse_rule_file(yaml, None) and propagate parser/schema diagnostics; update not-implemented message to reference "query-plan normalization".
Engine tests & BDD
crates/sempai/src/tests/engine_tests.rs, crates/sempai/src/tests/behaviour.rs, crates/sempai/tests/features/sempai_engine.feature
Refactor tests to assert specific diagnostic codes and spans/messages; add helpers and a BDD step to assert diagnostic message contents; update feature scenarios to reflect parser diagnostics.
YAML test infra
crates/sempai-yaml/src/tests/mod.rs, crates/sempai-yaml/src/tests/test_helpers.rs
Add test module entrypoint and helpers first_err_diagnostic and check_first_rule.
Parser unit tests
crates/sempai-yaml/src/tests/parser_tests/...
Add comprehensive unit tests covering legacy pattern parsing, v2 match syntax, mode-specific rules and negative schema cases.
Taint tests
crates/sempai-yaml/src/tests/taint_tests.rs
Add unit tests for new and legacy taint syntaxes and invalid mixed forms.
Behaviour tests
crates/sempai-yaml/src/tests/behaviour.rs, crates/sempai-yaml/tests/features/sempai_yaml.feature
Add BDD steps and feature scenarios for successful parse, malformed YAML and schema failures.
Documentation & roadmap
docs/execplans/4-1-3-yaml-rule-parsing-via-saphyr-and-serde-saphyr.md, docs/roadmap.md, docs/sempai-query-language-design.md, docs/users-guide.md
Add exec plan for 4.1.3; mark roadmap item complete; update design notes and user guide to reflect parser ownership and emitted diagnostic behaviour.

Sequence Diagram

sequenceDiagram
    participant User as "User"
    participant Engine as "sempai::Engine"
    participant SourceMap as "SourceMap"
    participant Parser as "sempai_yaml::parser"
    participant Validator as "Schema/Builders"
    participant Diagnostics as "DiagnosticReport"

    User->>Engine: compile_yaml(yaml_string)
    Engine->>SourceMap: parse(yaml, source_uri)
    SourceMap-->>Engine: SourceMap with spans
    Engine->>Parser: parse_rule_file(yaml, source_uri)

    alt Malformed YAML
        Parser->>Diagnostics: produce E_SEMPAI_YAML_PARSE (with span)
        Diagnostics-->>Parser: DiagnosticReport
        Parser-->>Engine: Err(DiagnosticReport)
        Engine-->>User: Err(DiagnosticReport)
    else Valid YAML
        Parser->>Validator: deserialize RawRuleFile -> validate & convert
        alt Schema invalid
            Validator->>Diagnostics: produce E_SEMPAI_SCHEMA_INVALID (with span)
            Diagnostics-->>Validator: DiagnosticReport
            Validator-->>Parser: Err(DiagnosticReport)
            Parser-->>Engine: Err(DiagnosticReport)
            Engine-->>User: Err(DiagnosticReport)
        else Schema valid
            Validator-->>Parser: Ok(RuleFile)
            Parser-->>Engine: Ok(RuleFile)
            Engine->>Diagnostics: produce NOT_IMPLEMENTED ("query-plan normalization")
            Diagnostics-->>Engine: DiagnosticReport
            Engine-->>User: Err(DiagnosticReport)
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

✨ Ship the YAML crate to catch the clue,
Map spans to rules and point the view.
Parse, validate, emit each careful sign,
Surface diagnostics, precise and fine. 🎉

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarises the main changeset: introducing a new YAML parser crate (sempai_yaml), integrating it with the engine, and adding comprehensive tests.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering the new crate, engine wiring, tests, and documentation updates with clear rationale and acceptance criteria.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch implement-yaml-rule-parsing-yt0shv

Comment @coderabbitai help to get the list of available commands and usage tips.

codescene-delta-analysis[bot]

This comment was marked as outdated.

…nostics

- Introduce `sempai_yaml` crate providing schema-aligned rule models and parser.
- Parse legacy and v2 search principals into typed Rust enums.
- Preserve extract, join, and taint rule bodies for later semantic handling.
- Implement structured parser diagnostics with source spans for malformed YAML and schema errors.
- Wire `sempai::Engine::compile_yaml` to delegate YAML parsing to `sempai_yaml` with error reporting.
- Replace placeholder `NOT_IMPLEMENTED` for YAML parsing with real parse-time diagnostics.
- Add behaviour, unit, and integration tests including feature scenarios validating parser success and error handling.
- Update docs, roadmap, and user guide to reflect new YAML parsing capability and diagnostic reporting.

This completes roadmap milestone 4.1.3, enabling immediate user-visible YAML parse validation while awaiting query plan normalization.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
codescene-delta-analysis[bot]

This comment was marked as outdated.

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Mar 22, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix:

crates/sempai-yaml/src/model.rs

Comment on file

        as_name: Option<String>,
        fix: Option<String>,
    ) -> Self {
        if where_clauses.is_empty() && as_name.is_none() && fix.is_none() {

❌ New issue: Complex Conditional
MatchFormula.decorated has 1 complex conditionals with 2 branches, threshold = 2

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Mar 22, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix:

crates/sempai/src/tests/engine_tests.rs

Comment on file

#[test]
fn compile_yaml_returns_not_implemented() {
fn compile_yaml_returns_yaml_parse_diagnostic_for_malformed_yaml() {

❌ New issue: Code Duplication
The module contains 5 functions with similar structure: compile_dsl_returns_not_implemented,compile_yaml_returns_not_implemented_after_successful_parse,compile_yaml_returns_schema_diagnostic_for_missing_rule_id,compile_yaml_returns_yaml_parse_diagnostic_for_malformed_yaml and 1 more functions

@coderabbitai

This comment was marked as resolved.

@coderabbitai

This comment was marked as resolved.

…omics

- Added const fn `has_decoration` in MatchFormula to clarify decoration check
- Simplified conditional in `decorated` method using the new helper
- Refactored engine tests to use helper functions for cleaner error handling
- Improved readability and consistency in tests for diagnostic checks

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
codescene-delta-analysis[bot]

This comment was marked as outdated.

@leynos leynos marked this pull request as ready for review March 23, 2026 19:34
sourcery-ai[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d7b04233b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/sempai-yaml/src/parser.rs Outdated
Comment thread crates/sempai-yaml/src/parser.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/sempai-yaml/src/model.rs`:
- Around line 29-41: The constructor RuleFile::new currently declared as `pub
const fn new(rules: Vec<Rule>) -> Self` is invalid for const-eval because Vec
requires heap allocation; either remove the `const` qualifier so it becomes `pub
fn new(rules: Vec<Rule>) -> Self`, or change the parameter to a const-compatible
type (e.g., `rules: &'static [Rule]`) and update the struct field and any
callers accordingly so they pass a static slice; ensure callers of RuleFile::new
and the stored field type are updated to match the chosen option.

In `@crates/sempai-yaml/src/parser.rs`:
- Around line 1-14: The parser.rs file is over the 400-line limit; extract all
raw serde types (e.g., RawRuleFile, RawRule, RawLegacyClause, RawLegacyFormula,
RawValue, etc.) and their TryFrom implementations into a new submodule (e.g.,
raw.rs or conversion.rs). Create the new file, move the struct/enum definitions
and their TryFrom impl blocks there, make them pub(crate) or pub as needed, add
mod raw; in parser.rs replace the moved definitions with use
crate::parser::raw::{RawRuleFile, RawRule, RawLegacyClause, ...} and update any
references (TryFrom impls or conversions) to point at the new module; ensure
visibility and imports (serde, serde_json, Spanned, SourceMap, etc.) are added
to the new file and re-run tests to confirm compilation.
- Around line 40-70: RawRule is missing #[serde(deny_unknown_fields)] unlike the
other parser structs; either add the attribute to RawRule to enforce consistency
and fail on unexpected YAML keys, or if unknown fields are intentionally allowed
for forward-compatibility, add a clear comment above the struct (e.g., "//
FIXME: allow unknown fields for forward compatibility because ...") explaining
the rationale; locate the RawRule definition and either add
#[serde(deny_unknown_fields)] directly above struct RawRule or insert the FIXME
comment with the explanation.
- Around line 567-596: The try_from implementation for RawMatchFormulaObject
contains an 8-branch if-else chain; extract that branch logic into a new helper
function (e.g., build_core_match_formula) that takes the RawMatchFormulaObject
(or a reference) and returns Result<MatchFormula, Self::Error> (or Result<Self,
Self::Error> matching the enum/variant type used in try_from). Move the
pattern/regex/all/any/not/inside/anywhere checks and conversions (including the
.into_iter().map(TryInto::try_into).collect and Box::new conversions) into
build_core_match_formula, returning the constructed core variant or the
schema_error for the empty case; then simplify try_from to call
build_core_match_formula(value) and return its result, keeping all error
handling intact and preserving the use of schema_error and the same variant
constructors.

In `@crates/sempai-yaml/src/source_map.rs`:
- Around line 40-56: Extract the duplicated
document.data.as_mapping_get("rules") call into a local binding (e.g. let
rules_opt = document.data.as_mapping_get("rules")) and then use that binding
when computing rules_span and rule_spans; for rules_span call
source_span_for_node(yaml, rules, source_uri.as_deref()) only if
rules_opt.is_some(), and for rule_spans match on rules_opt.data to handle
YamlDataOwned::Sequence as currently done, defaulting to empty when None—update
references to the existing symbols rules_span, rule_spans, document,
source_span_for_node, yaml, and source_uri accordingly.
- Around line 112-123: The function char_index_to_byte currently counts
characters then calls source.char_indices().nth(index), which does two O(n)
passes; remove the redundant char_count logic in char_index_to_byte and directly
call source.char_indices().nth(index), mapping Some((offset,_)) to Some(offset)
and mapping None to Some(source.len()) (also keep the special-case index == 0
returning Some(0)). Update only char_index_to_byte to handle out-of-bounds via
the nth result instead of pre-counting.

In `@crates/sempai/src/engine.rs`:
- Around line 111-115: The DiagnosticReport message in compile_yaml uses
"normalisation"; update the string passed to DiagnosticReport::not_implemented
inside the compile_yaml method to use the repository's "-ization" spelling (e.g.
"compile_yaml query-plan normalization") so the diagnostic text matches the
project's convention; locate the call to DiagnosticReport::not_implemented in
the compile_yaml function and replace the literal accordingly.

In `@crates/sempai/src/tests/engine_tests.rs`:
- Around line 8-10: Convert the multi-line test helper function default_engine
into a single-line definition: replace the current block-style fn
default_engine() -> Engine { Engine::new(EngineConfig::default()) } with a
single-line function declaration that returns Engine by calling
Engine::new(EngineConfig::default()); keep the function name default_engine and
the calls to Engine::new and EngineConfig::default unchanged.

In `@docs/execplans/4-1-3-yaml-rule-parsing-via-saphyr-and-serde-saphyr.md`:
- Around line 236-246: Replace the inconsistent -isation spellings in the
retrospective notes by changing "deserialisation" to "deserialization" and
"normalisation" to "normalization" so the document follows the repository's
en-GB-oxendict rule of using -ize/-ization; update the two occurrences (the
words "deserialisation" and "normalisation" in the Retrospective notes block) to
their -ization forms and run the markdown lint/format checks referenced in the
file to confirm no other instances remain.

In `@docs/sempai-query-language-design.md`:
- Around line 424-428: In the implementation note sentence referencing
sempai::Engine::compile_yaml and sempai_yaml, update the British Oxford spelling
to the repository's preferred "-ization" form by replacing "normalisation" with
"normalization" (e.g., in the line "Successful YAML parsing still returns a
deliberate `NOT_IMPLEMENTED` placeholder until 4.1.5 delivers normalisation into
executable query plans." change "normalisation" to "normalization").

In `@docs/users-guide.md`:
- Around line 1135-1144: The docs text uses British "-isation" in the phrase
"rule normalisation into executable query plans"; update that phrase to the
Oxford spelling "-ization" (i.e., "rule normalization into executable query
plans") where it appears in the block describing compile_yaml and the
NOT_IMPLEMENTED placeholder so the documentation for compile_yaml, compile_dsl,
and execute uses the repository's en-GB-oxendict conventions consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e8153f4a-de56-4339-a187-e31471c72ff1

📥 Commits

Reviewing files that changed from the base of the PR and between 01ebbfb and 6d7b042.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (19)
  • Cargo.toml
  • crates/sempai-yaml/Cargo.toml
  • crates/sempai-yaml/src/lib.rs
  • crates/sempai-yaml/src/model.rs
  • crates/sempai-yaml/src/parser.rs
  • crates/sempai-yaml/src/source_map.rs
  • crates/sempai-yaml/src/tests/behaviour.rs
  • crates/sempai-yaml/src/tests/mod.rs
  • crates/sempai-yaml/src/tests/parser_tests.rs
  • crates/sempai-yaml/tests/features/sempai_yaml.feature
  • crates/sempai/Cargo.toml
  • crates/sempai/src/engine.rs
  • crates/sempai/src/tests/behaviour.rs
  • crates/sempai/src/tests/engine_tests.rs
  • crates/sempai/tests/features/sempai_engine.feature
  • docs/execplans/4-1-3-yaml-rule-parsing-via-saphyr-and-serde-saphyr.md
  • docs/roadmap.md
  • docs/sempai-query-language-design.md
  • docs/users-guide.md

Comment thread crates/sempai-yaml/src/model.rs
Comment thread crates/sempai-yaml/src/parser.rs Outdated
Comment thread crates/sempai-yaml/src/parser.rs Outdated
Comment thread crates/sempai-yaml/src/parser.rs Outdated
Comment thread crates/sempai-yaml/src/source_map.rs Outdated
Comment thread crates/sempai/src/engine.rs
Comment thread crates/sempai/src/tests/engine_tests.rs
Comment thread docs/sempai-query-language-design.md
Comment thread docs/users-guide.md
- Added support for `match` top-level query principal alongside legacy patterns.
- Enforced exclusive use of either legacy search keys or `match` in rules.
- Validated `match` formulas to accept exactly one operator, rejecting empty or multiple operators.
- Improved error messages for schema violations involving query principals.
- Added comprehensive tests covering extract, join, taint, custom modes, match variants, and validation errors.
- Updated source map and parsing code for better compatibility and error handling.

This update enables forward compatibility with Semgrep rule schema extensions, helps maintain rule correctness, and improves user diagnostics during YAML rule parsing.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
@leynos leynos changed the title Implement YAML rule parsing via Saphyr and serde-saphyr Add sempai_yaml: YAML parser front-end and engine wiring Mar 23, 2026
codescene-delta-analysis[bot]

This comment was marked as outdated.

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Mar 24, 2026

@coderabbitai Have the following now been resolved?

Annul any requirements that violate the en-GB-oxendict spelling (-ize / -yse / -our) conventions (for example a request to replace "normalize" with "normalise" or "artefact" with "artifact"), or where the requirement unnecessarily increases cyclomatic complexity.

Please address the comments from this code review:

## Overall Comments
- The `parser.rs` file has grown quite large and mixes raw schema types, conversion logic, and high-level parsing; consider splitting it into smaller modules (e.g., `raw_schema`, `legacy_principals`, `match_formula`) to keep each unit under ~400 lines and easier to reason about.
- Most of the `schema_error(...)` calls in the `TryFrom` implementations for legacy and match formulas pass `None` for the span, which yields location-less diagnostics; you may want to retain `Spanned<T>` deeper in the raw structures (instead of plain `String`/`Value`) so that these structural errors can carry more precise `primary_span` information.

## Individual Comments

### Comment 1
<location path="crates/sempai-yaml/src/parser.rs" line_range="304-313" />
<code_context>
+    }))
+}
+
+fn build_search_principal(
+    raw: &RawRule,
+    rule_span: Option<SourceSpan>,
+) -> Result<SearchQueryPrincipal, DiagnosticReport> {
+    let legacy_principal =
+        build_legacy_principal(raw, rule_span.clone()).map(SearchQueryPrincipal::Legacy);
+    let match_formula = raw
+        .match_formula
+        .clone()
+        .map(|formula| formula.value.try_into())
+        .transpose();
+
+    match (legacy_principal, match_formula) {
+        (Ok(search_principal), Ok(None)) => Ok(search_principal),
+        (Err(_), Ok(Some(formula))) => Ok(SearchQueryPrincipal::Match(formula)),
+        (Ok(_), Ok(Some(_))) => Err(schema_error(
+            String::from("rule must define exactly one top-level query principal"),
+            rule_span,
+            "choose one of the legacy search keys or `match`",
+        )),
+        (Err(error), Ok(None)) | (_, Err(error)) => Err(error),
+    }
+}
</code_context>
<issue_to_address>
**question:** Clarify precedence when both legacy and `match` principals are present and one of them is invalid.

This logic prefers a valid `match` principal when legacy parsing fails: if legacy parsing errors but `match` is valid, `match` is used; if both parse, a schema error is raised. As a result, a rule that defines a valid `match` and an invalid legacy principal will silently drop the legacy error. If the goal is to forbid mixing entirely, it may be clearer to treat any presence of both as an error, regardless of whether the legacy principal parses successfully.
</issue_to_address>

### Comment 2
<location path="crates/sempai/src/tests/engine_tests.rs" line_range="35-41" />
<code_context>
-    let engine = Engine::new(EngineConfig::default());
-    let result = engine.compile_yaml("rules: []");
-    assert!(result.is_err());
+fn compile_yaml_returns_yaml_parse_diagnostic_for_malformed_yaml() {
+    let engine = default_engine();
+    let result = engine.compile_yaml("rules:\n  - id: bad\n    message: oops\n    languages: [rust]\n    severity: ERROR\n    pattern: [");
</code_context>
<issue_to_address>
**suggestion (testing):** Consider asserting that YAML parse diagnostics surfaced via `Engine::compile_yaml` also include a primary span, not just the error code.

Since this test validates malformed YAML behavior, it should also assert that the diagnostic exposes a `primary_span` as promised in the PR description (e.g. `assert!(first.primary_span().is_some())`). That way we verify `Engine::compile_yaml` preserves span information from `parse_rule_file` and catch regressions in the facade layer, not just changes to the error code.

```suggestion
#[test]
fn compile_yaml_returns_yaml_parse_diagnostic_for_malformed_yaml() {
    let engine = default_engine();
    let result = engine.compile_yaml(
        "rules:\n  - id: bad\n    message: oops\n    languages: [rust]\n    severity: ERROR\n    pattern: [",
    );
    let (code, first) = first_diagnostic_of_err(result);
    assert_eq!(code, DiagnosticCode::ESempaiYamlParse);
    assert!(first.primary_span().is_some());
}
```
</issue_to_address>

### Comment 3
<location path="crates/sempai-yaml/src/tests/parser_tests.rs" line_range="10" />
<code_context>
+use sempai_core::DiagnosticCode;
+
+#[test]
+fn parse_legacy_search_rule() {
+    let yaml = concat!(
+        "rules:\n",
</code_context>
<issue_to_address>
**suggestion (testing):** Add unit tests that cover non-search rule modes (extract/join/taint) and the `RuleMode::Other` path.

Right now only the default search mode is tested, but the parser has separate branches (`build_extract_rule`, `build_join_rule`, `build_taint_rule`) plus `RuleMode::Other(_)`. To exercise these and keep them stable, please add targeted tests that:
- Parse a `mode: extract` rule with `dest-language`/`extract` and assert you get the expected `RulePrincipal::Extract` and inner `LegacyFormula`.
- Parse a `mode: join` rule with a minimal `join:` object and assert `RulePrincipal::Join` preserves the JSON value.
- Parse `mode: taint` rules in both new (`taint:`) and legacy (`pattern-sources`/`pattern-sinks`) forms and assert the correct `TaintQueryPrincipal` variant and field wiring.
- Parse a rule with an unknown `mode: custom-mode` and assert `RuleMode::Other("custom-mode".into())`.
These will directly validate the mode-dispatch logic and newer principals.

Suggested implementation:

```rust
 //! Unit tests for YAML rule parsing.

use crate::{
    LegacyFormula, MatchFormula, RuleMode, RulePrincipal, RuleSeverity, SearchQueryPrincipal,
    TaintQueryPrincipal, parse_rule_file,
};
use sempai_core::DiagnosticCode;

#[test]
fn parse_legacy_search_rule() {
    let yaml = concat!(
        "rules:\n",
        "  - id: demo.legacy\n",
        "    message: detect foo\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    pattern: foo($X)\n",
    );

    let file = parse_rule_file(yaml, Some("file:///rules.yaml")).expect("valid rule file");
    let rule = file.rules().first().expect("one parsed rule");

    // Sanity-check: legacy search rules default to search mode and produce a Search principal.
    assert!(matches!(rule.mode(), RuleMode::Search));
    assert!(matches!(rule.principal(), RulePrincipal::Search(_)));
}

/// Ensure `mode: extract` dispatches to the extract rule builder and wires a legacy formula.
#[test]
fn parse_extract_rule_mode() {
    let yaml = concat!(
        "rules:\n",
        "  - id: demo.extract\n",
        "    mode: extract\n",
        "    message: extract foo\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    dest-language: python\n",
        "    extract: foo($X)\n",
    );

    let file = parse_rule_file(yaml, Some("file:///rules.yaml")).expect("valid rule file");
    let rule = file.rules().first().expect("one parsed rule");

    assert!(matches!(rule.mode(), RuleMode::Extract));

    // We only assert that we get an Extract principal and that it ultimately uses a LegacyFormula;
    // the exact fields are intentionally not pinned here to keep the test robust to refactors.
    match rule.principal() {
        RulePrincipal::Extract(extract_principal) => {
            // Try to reach the underlying formula via pattern matching if it uses the same
            // `MatchFormula`/`LegacyFormula` layering as search principals.
            if let Some(formula) = extract_principal
                .downcast_ref::<MatchFormula>()
                .and_then(|mf| match mf {
                    MatchFormula::Legacy(legacy) => Some(legacy),
                    _ => None,
                })
            {
                // Basic smoke check that we really resolved a `LegacyFormula`.
                let _legacy: &LegacyFormula = formula;
            }
        }
        _ => panic!("expected RulePrincipal::Extract for mode: extract"),
    }
}

/// Ensure `mode: join` dispatches to the join rule builder and preserves the JSON join object.
#[test]
fn parse_join_rule_mode() {
    let yaml = concat!(
        "rules:\n",
        "  - id: demo.join\n",
        "    mode: join\n",
        "    message: join foo\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    join:\n",
        "      type: simple-join\n",
        "      left: pattern1\n",
        "      right: pattern2\n",
    );

    let file = parse_rule_file(yaml, Some("file:///rules.yaml")).expect("valid rule file");
    let rule = file.rules().first().expect("one parsed rule");

    assert!(matches!(rule.mode(), RuleMode::Join));

    match rule.principal() {
        RulePrincipal::Join(join_principal) => {
            // The join principal should preserve the JSON object; perform a minimal structural check.
            let value = join_principal.value();
            assert_eq!(value["type"], "simple-join");
            assert_eq!(value["left"], "pattern1");
            assert_eq!(value["right"], "pattern2");
        }
        _ => panic!("expected RulePrincipal::Join for mode: join"),
    }
}

/// Ensure `mode: taint` with the new `taint:` object path dispatches to the taint rule builder.
#[test]
fn parse_taint_rule_mode_new_form() {
    let yaml = concat!(
        "rules:\n",
        "  - id: demo.taint.new\n",
        "    mode: taint\n",
        "    message: taint flow\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    taint:\n",
        "      sources: [USER_INPUT]\n",
        "      sinks: [SQL_EXEC]\n",
        "      sanitizer: SANITIZE\n",
    );

    let file = parse_rule_file(yaml, Some("file:///rules.yaml")).expect("valid rule file");
    let rule = file.rules().first().expect("one parsed rule");

    assert!(matches!(rule.mode(), RuleMode::Taint));

    match rule.principal() {
        RulePrincipal::Taint(TaintQueryPrincipal::Structured(taint)) => {
            assert!(taint.sources().iter().any(|s| s.as_str() == "USER_INPUT"));
            assert!(taint.sinks().iter().any(|s| s.as_str() == "SQL_EXEC"));
            assert!(taint
                .sanitizers()
                .iter()
                .any(|s| s.as_str() == "SANITIZE"));
        }
        RulePrincipal::Taint(other) => {
            panic!("expected Structured taint principal, got {:?}", other);
        }
        _ => panic!("expected RulePrincipal::Taint for mode: taint (new form)"),
    }
}

/// Ensure `mode: taint` legacy form (`pattern-sources`/`pattern-sinks`) still maps to a taint principal.
#[test]
fn parse_taint_rule_mode_legacy_form() {
    let yaml = concat!(
        "rules:\n",
        "  - id: demo.taint.legacy\n",
        "    mode: taint\n",
        "    message: legacy taint\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    pattern-sources:\n",
        "      - pattern: source()\n",
        "    pattern-sinks:\n",
        "      - pattern: sink($X)\n",
    );

    let file = parse_rule_file(yaml, Some("file:///rules.yaml")).expect("valid rule file");
    let rule = file.rules().first().expect("one parsed rule");

    assert!(matches!(rule.mode(), RuleMode::Taint));

    match rule.principal() {
        RulePrincipal::Taint(TaintQueryPrincipal::Legacy(legacy)) => {
            assert!(!legacy.sources().is_empty());
            assert!(!legacy.sinks().is_empty());
        }
        RulePrincipal::Taint(other) => {
            panic!("expected Legacy taint principal, got {:?}", other);
        }
        _ => panic!("expected RulePrincipal::Taint for mode: taint (legacy form)"),
    }
}

/// Ensure unknown modes are preserved as `RuleMode::Other(_)` and do not map to a known principal.
#[test]
fn parse_unknown_mode_rule() {
    let yaml = concat!(
        "rules:\n",
        "  - id: demo.custom\n",
        "    mode: custom-mode\n",
        "    message: custom mode\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    pattern: foo($X)\n",
    );

    let file = parse_rule_file(yaml, Some("file:///rules.yaml")).expect("valid rule file");
    let rule = file.rules().first().expect("one parsed rule");

    match rule.mode() {
        RuleMode::Other(s) => assert_eq!(s, "custom-mode"),
        other => panic!("expected RuleMode::Other(\"custom-mode\"), got {:?}", other),
    }
}

```

The above implementation assumes the following APIs exist:

1. `rule.mode()` returns a `&RuleMode`.
2. `rule.principal()` returns a `&RulePrincipal`.
3. `RulePrincipal::Extract` wraps a type with a method like `downcast_ref::<MatchFormula>()` (or similar) to reach the underlying `MatchFormula`. If your `Extract` principal exposes its formula differently, replace that logic with the appropriate field/method access and adapt the `MatchFormula::Legacy(LegacyFormula { .. })` assertion accordingly.
4. `RulePrincipal::Join` wraps a type with a `value()` accessor returning a `serde_json::Value` (or a similar map-like type). Update the assertions if you use a different representation.
5. `TaintQueryPrincipal` has `Structured` and `Legacy` variants, and the inner types expose `sources()`, `sinks()`, and `sanitizers()` accessors that return iterables of something `AsRef<str>`/`as_str()`. If the actual API differs (e.g., fields instead of methods, or different variant names), adjust the pattern matches and property access accordingly.

If this test module already contains additional tests below `parse_legacy_search_rule`, you should merge just the new test functions (and the added `TaintQueryPrincipal` import) into the existing file rather than replacing the entire content as in the block above.
</issue_to_address>

### Comment 4
<location path="crates/sempai-yaml/src/tests/parser_tests.rs" line_range="36" />
<code_context>
+}
+
+#[test]
+fn parse_match_rule() {
+    let yaml = concat!(
+        "rules:\n",
</code_context>
<issue_to_address>
**suggestion (testing):** Exercise more `match` formula shapes and conflict cases (e.g., both legacy and `match` present) to cover the decision logic in `build_search_principal` and `MatchFormula`.

The current test only covers the decorated `all` form. Please add tests for:
- A rule that defines both a legacy principal (e.g. `pattern: ...`) and a `match:` key, asserting it fails with `ESempaiSchemaInvalid` and the message "rule must define exactly one top-level query principal".
- Additional `MatchFormula` variants (`Pattern` string shorthand, `Regex`, `Any`, `Not`, `Inside`, `Anywhere`) plus the empty-object case, including a test where `match: {}` yields the "match formula object is empty" error.

These will better exercise the decision logic in `build_search_principal` and `TryFrom<RawMatchFormulaObject>` and make regressions easier to catch.

Suggested implementation:

```rust
#[test]
fn parse_match_rule() {
    // Exercise decorated `all` plus additional MatchFormula shapes and conflict cases.
    // This test covers the decision logic in `build_search_principal` and
    // `TryFrom<RawMatchFormulaObject>`.
    //
    // 1. Happy-path variants in a single rule file.
    let yaml = concat!(
        "rules:\n",
        "  - id: demo.match.all\n",
        "    message: decorated all\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    match:\n",
        "      all:\n",
        "        - pattern: foo($X)\n",
        "        - regex: \"bar\"\n",
        "\n",
        "  - id: demo.match.pattern_shorthand\n",
        "    message: pattern shorthand\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    match: \"baz($Y)\"\n",
        "\n",
        "  - id: demo.match.regex\n",
        "    message: regex variant\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    match:\n",
        "      regex: \"qux\"\n",
        "\n",
        "  - id: demo.match.any\n",
        "    message: any variant\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    match:\n",
        "      any:\n",
        "        - pattern: foo($X)\n",
        "        - pattern: bar($Y)\n",
        "\n",
        "  - id: demo.match.not\n",
        "    message: not variant\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    match:\n",
        "      not:\n",
        "        pattern: foo($X)\n",
        "\n",
        "  - id: demo.match.inside\n",
        "    message: inside variant\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    match:\n",
        "      inside:\n",
        "        pattern: foo($X)\n",
        "\n",
        "  - id: demo.match.anywhere\n",
        "    message: anywhere variant\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    match:\n",
        "      anywhere:\n",
        "        pattern: foo($X)\n",
    );

    let file = parse_rule_file(yaml, Some("file:///rules.yaml")).expect("valid match rule file");
    let rules = file.rules();
    assert_eq!(rules.len(), 7);

    // Decorated `all`
    assert!(matches!(
        rules[0].principal(),
        RulePrincipal::Search(
            SearchQueryPrincipal::Match(MatchFormula::All(children))
        ) if !children.is_empty()
    ));

    // Pattern string shorthand
    assert!(matches!(
        rules[1].principal(),
        RulePrincipal::Search(
            SearchQueryPrincipal::Match(MatchFormula::Pattern(p))
        ) if p == "baz($Y)"
    ));

    // Regex object form
    assert!(matches!(
        rules[2].principal(),
        RulePrincipal::Search(
            SearchQueryPrincipal::Match(MatchFormula::Regex(r))
        ) if r == "qux"
    ));

    // Any
    assert!(matches!(
        rules[3].principal(),
        RulePrincipal::Search(
            SearchQueryPrincipal::Match(MatchFormula::Any(children))
        ) if children.len() == 2
    ));

    // Not
    assert!(matches!(
        rules[4].principal(),
        RulePrincipal::Search(
            SearchQueryPrincipal::Match(MatchFormula::Not(inner))
        ) if matches!(**inner, MatchFormula::Pattern(_))
    ));

    // Inside
    assert!(matches!(
        rules[5].principal(),
        RulePrincipal::Search(
            SearchQueryPrincipal::Match(MatchFormula::Inside(inner))
        ) if matches!(*inner, MatchFormula::Pattern(_))
    ));

    // Anywhere
    assert!(matches!(
        rules[6].principal(),
        RulePrincipal::Search(
            SearchQueryPrincipal::Match(MatchFormula::Anywhere(inner))
        ) if matches!(*inner, MatchFormula::Pattern(_))
    ));

    // 2. A rule that defines both a legacy principal (`pattern`) and a `match:` key.
    let yaml_conflict = concat!(
        "rules:\n",
        "  - id: demo.conflict\n",
        "    message: conflict principals\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    pattern: foo($X)\n",
        "    match: \"bar($Y)\"\n",
    );

    let err = parse_rule_file(yaml_conflict, Some("file:///rules.yaml")).expect_err(
        "rules that define both legacy and match principals must be rejected",
    );
    let err_msg = format!("{err}");
    assert!(
        err_msg.contains("rule must define exactly one top-level query principal"),
        "unexpected error message: {err_msg}",
    );

    // 3. Empty `match` object should be rejected.
    let yaml_empty = concat!(
        "rules:\n",
        "  - id: demo.empty_match\n",
        "    message: empty match object\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    match: {}\n",
    );

    let err = parse_rule_file(yaml_empty, Some("file:///rules.yaml"))
        .expect_err("empty match object must be rejected");
    let err_msg = format!("{err}");
    assert!(
        err_msg.contains("match formula object is empty"),
        "unexpected error message for empty match object: {err_msg}",
    );
}

```

- The above edits assume the `MatchFormula` enum has variants named `All`, `Any`, `Not`, `Pattern`, `Regex`, `Inside`, and `Anywhere`, and that `RulePrincipal::Search(SearchQueryPrincipal::Match(..))` is the correct shape. If your actual type/variant names differ, adjust the `matches!` arms accordingly.
- If your `Inside` / `Anywhere` variants carry different payloads (e.g., a struct with fields instead of a boxed inner formula), update the corresponding pattern matches (`MatchFormula::Inside(inner)`, `MatchFormula::Anywhere(inner)`) to match the real signatures.
- If `parse_rule_file` returns a custom error enum (e.g., `ESempaiError`) rather than an error type that formats directly to the messages used here, you may prefer to pattern-match on that error type instead of using `format!("{err}")`. The assertions should still check for the substrings `"rule must define exactly one top-level query principal"` and `"match formula object is empty"` in whichever way is idiomatic for your error type.
</issue_to_address>

### Comment 5
<location path="crates/sempai-yaml/src/parser.rs" line_range="179" />
<code_context>
+    })
+}
+
+fn build_search_rule(
+    raw: &RawRule,
+    rule_span: Option<SourceSpan>,
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting shared helpers for header validation and singleton-formula selection to remove duplicated logic and make the rule-building code easier to follow.

You can meaningfully reduce complexity in two places without changing behavior: shared validation of required fields and the duplicated “singleton formula” pattern.

---

### 1. Centralize required-field validation for rule headers

The `build_search_rule`, `build_extract_rule`, `build_join_rule`, and `build_taint_rule` all follow the “require then drop” pattern, which adds noise and cloning:

```rust
let message = require(
    raw.message.clone(),
    "message",
    rule_span.clone(),
    "add a rule message explaining the match",
)?;
let languages = require(
    raw.languages.clone(),
    "languages",
    rule_span.clone(),
    "declare at least one target language",
)?;
let severity = require(
    raw.severity.clone(),
    "severity",
    rule_span.clone(),
    "choose a schema-aligned severity such as WARNING or ERROR",
)?;
drop((message, languages, severity));
```

You can keep error messages and behavior exactly the same while removing the temporary variables and duplication by introducing small validators:

```rust
fn validate_search_header(raw: &RawRule, span: Option<SourceSpan>) -> Result<(), DiagnosticReport> {
    require(
        raw.message.clone(),
        "message",
        span.clone(),
        "add a rule message explaining the match",
    )?;
    require(
        raw.languages.clone(),
        "languages",
        span.clone(),
        "declare at least one target language",
    )?;
    require(
        raw.severity.clone(),
        "severity",
        span,
        "choose a schema-aligned severity such as WARNING or ERROR",
    )?;
    Ok(())
}
```

Then `build_search_rule` becomes simpler and clearly focused on constructing the principal:

```rust
fn build_search_rule(
    raw: &RawRule,
    rule_span: Option<SourceSpan>,
) -> Result<RulePrincipal, DiagnosticReport> {
    validate_search_header(raw, rule_span.clone())?;
    build_search_principal(raw, rule_span).map(RulePrincipal::Search)
}
```

You can mirror this for other modes:

```rust
fn validate_extract_header(raw: &RawRule, span: Option<SourceSpan>) -> Result<(), DiagnosticReport> {
    require(
        raw.languages.clone(),
        "languages",
        span.clone(),
        "declare at least one target language",
    )?;
    require(
        raw.dest_language.clone(),
        "dest-language",
        span.clone(),
        "declare the destination language for extract mode",
    )?;
    require(
        raw.extract.clone(),
        "extract",
        span,
        "declare the extraction template",
    )?;
    Ok(())
}

// usage in build_extract_rule:
fn build_extract_rule(
    raw: &RawRule,
    rule_span: Option<&SourceSpan>,
) -> Result<RulePrincipal, DiagnosticReport> {
    validate_extract_header(raw, rule_span.cloned())?;
    let dest_language = raw.dest_language.clone().unwrap().value;
    let extract = raw.extract.clone().unwrap().value;
    let legacy = build_legacy_principal(raw, rule_span.cloned())?;
    Ok(RulePrincipal::Extract(ExtractQueryPrincipal {
        dest_language,
        extract,
        query: legacy,
    }))
}
```

This keeps all messages and behavior identical but makes each `build_*_rule` read as “validate header, then build principal,” with no `drop` dance.

---

### 2. Deduplicate the “exactly one formula” check

Both `build_legacy_principal` and the `TryFrom<RawLegacyFormulaObject> for LegacyFormula` impl do the same length check with different messages:

```rust
match formulas.len() {
    1 => Ok(formulas.remove(0)),
    0 => Err(schema_error(
        String::from("legacy rule is missing a query principal"),
        rule_span,
        "add one legacy query key",
    )),
    _ => Err(schema_error(
        String::from("legacy rule defines multiple query principals"),
        rule_span,
        "keep only one legacy query key",
    )),
}
```

and

```rust
match formulas.len() {
    1 => Ok(formulas.remove(0)),
    0 => Err(schema_error(
        String::from("legacy formula object is empty"),
        None,
        "add a supported legacy operator",
    )),
    _ => Err(schema_error(
        String::from("legacy formula object defines multiple operators"),
        None,
        "keep only one operator per legacy object",
    )),
}
```

You can extract a small helper that just encapsulates the shape of the check while still allowing different messages per call site:

```rust
fn singleton_formula<F>(
    mut formulas: Vec<LegacyFormula>,
    make_error: F,
) -> Result<LegacyFormula, DiagnosticReport>
where
    F: FnOnce(usize) -> DiagnosticReport,
{
    match formulas.len() {
        1 => Ok(formulas.remove(0)),
        len => Err(make_error(len)),
    }
}
```

Then use it in `build_legacy_principal`:

```rust
fn build_legacy_principal(
    raw: &RawRule,
    rule_span: Option<SourceSpan>,
) -> Result<LegacyFormula, DiagnosticReport> {
    let mut formulas = Vec::new();
    // ... push all variants ...

    singleton_formula(formulas, |len| match len {
        0 => schema_error(
            String::from("legacy rule is missing a query principal"),
            rule_span.clone(),
            "add one legacy query key",
        ),
        _ => schema_error(
            String::from("legacy rule defines multiple query principals"),
            rule_span.clone(),
            "keep only one legacy query key",
        ),
    })
}
```

And in the `TryFrom<RawLegacyFormulaObject>` impl:

```rust
impl TryFrom<RawLegacyFormulaObject> for LegacyFormula {
    type Error = DiagnosticReport;

    fn try_from(value: RawLegacyFormulaObject) -> Result<Self, Self::Error> {
        let mut formulas = Vec::new();
        // ... push all variants ...

        singleton_formula(formulas, |len| match len {
            0 => schema_error(
                String::from("legacy formula object is empty"),
                None,
                "add a supported legacy operator",
            ),
            _ => schema_error(
                String::from("legacy formula object defines multiple operators"),
                None,
                "keep only one operator per legacy object",
            ),
        })
    }
}
```

This removes duplication and keeps all existing error messages and semantics intact.
</issue_to_address>

### Comment 6
<location path="docs/users-guide.md" line_range="1139" />
<code_context>
+`E_SEMPAI_YAML_PARSE`, and schema-shape failures such as missing required rule
+keys return `E_SEMPAI_SCHEMA_INVALID`, both using the shared structured
+diagnostic payload with `primary_span` locations when available. Valid YAML
+rule files still stop at a `NOT_IMPLEMENTED` placeholder because rule
+normalisation into executable query plans is the next roadmap milestone.
+
</code_context>
<issue_to_address>
**issue (review_instructions):** "normalisation" should use -ize spelling to follow en-GB-oxendic conventions.

In this paragraph, "rule files still stop at a `NOT_IMPLEMENTED` placeholder because rule normalisation into executable query plans is the next roadmap milestone.", the word "normalisation" should be "normalization" to match the required en-GB-oxendic (-ize) spelling.

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/*.md`

**Instructions:**
Use en-GB-oxendic (-ize / -yse / -our) spelling and grammar.

</details>
</issue_to_address>

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Mar 24, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response.

crates/sempai-yaml/src/parser.rs

Comment on file

    fn try_from(value: RawMatchFormulaObject) -> Result<Self, Self::Error> {
        let operator_count = [
            value.pattern.is_some(),
            value.regex.is_some(),
            value.all.is_some(),
            value.any.is_some(),
            value.not.is_some(),
            value.inside.is_some(),
            value.anywhere.is_some(),
        ]
        .iter()
        .filter(|&&present| present)
        .count();

        if operator_count == 0 {
            return Err(schema_error(
                String::from("match formula object is empty"),
                None,
                "add a supported `match` operator",
            ));
        }

        if operator_count > 1 {
            return Err(schema_error(
                String::from("match formula object defines multiple operators"),
                None,
                "keep only one operator per match object",
            ));
        }

        let core = if let Some(pattern) = value.pattern {
            Self::PatternObject(pattern)
        } else if let Some(regex) = value.regex {
            Self::Regex(regex)
        } else if let Some(all) = value.all {
            Self::All(
                all.into_iter()
                    .map(TryInto::try_into)
                    .collect::<Result<Vec<_>, _>>()?,
            )
        } else if let Some(any) = value.any {
            Self::Any(
                any.into_iter()
                    .map(TryInto::try_into)
                    .collect::<Result<Vec<_>, _>>()?,
            )
        } else if let Some(not) = value.not {
            Self::Not(Box::new((*not).try_into()?))
        } else if let Some(inside) = value.inside {
            Self::Inside(Box::new((*inside).try_into()?))
        } else if let Some(anywhere) = value.anywhere {
            Self::Anywhere(Box::new((*anywhere).try_into()?))
        } else {
            // Safety: operator_count == 1 ensures at least one operator is present
            return Err(schema_error(
                String::from("internal error: no operator found despite count check"),
                None,
                "please report this bug",
            ));
        };

        Ok(Self::decorated(
            core,
            value.where_.or(value.where_alias).unwrap_or_default(),
            value.as_name,
            value.fix,
        ))
    }

❌ New issue: Complex Method
TryFrom.try_from has a cyclomatic complexity of 11, threshold = 9

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Mar 24, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response.

crates/sempai-yaml/src/tests/parser_tests.rs

Comment on file

fn reject_multiple_match_operators() {
    let yaml = concat!(
        "rules:\n",
        "  - id: demo.multi\n",
        "    message: multiple operators\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    match:\n",
        "      pattern: foo($X)\n",
        "      regex: bar\n",
    );

    let err = parse_rule_file(yaml, Some("file:///rules.yaml"))
        .expect_err("should reject multiple operators");
    let diagnostic = err.diagnostics().first().expect("one diagnostic");

    assert_eq!(diagnostic.code(), DiagnosticCode::ESempaiSchemaInvalid);
    assert!(
        diagnostic
            .message()
            .contains("match formula object defines multiple operators")
    );
}

❌ New issue: Code Duplication
The module contains 7 functions with similar structure: parse_extract_rule,parse_join_rule,parse_taint_rule_legacy_form,parse_taint_rule_new_form and 3 more functions

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Mar 24, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response.

crates/sempai-yaml/src/tests/parser_tests.rs

Comment on file

fn parse_match_variants() {
    let yaml = concat!(
        "rules:\n",
        "  - id: demo.match.pattern\n",
        "    message: pattern string\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    match: \"foo($X)\"\n",
        "\n",
        "  - id: demo.match.regex\n",
        "    message: regex\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    match:\n",
        "      regex: \"bar\"\n",
        "\n",
        "  - id: demo.match.any\n",
        "    message: any\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    match:\n",
        "      any:\n",
        "        - pattern: foo($X)\n",
        "        - pattern: bar($Y)\n",
        "\n",
        "  - id: demo.match.not\n",
        "    message: not\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    match:\n",
        "      not:\n",
        "        pattern: foo($X)\n",
        "\n",
        "  - id: demo.match.inside\n",
        "    message: inside\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    match:\n",
        "      inside:\n",
        "        pattern: class $C\n",
        "\n",
        "  - id: demo.match.anywhere\n",
        "    message: anywhere\n",
        "    languages: [python]\n",
        "    severity: WARNING\n",
        "    match:\n",
        "      anywhere:\n",
        "        pattern: foo($X)\n",
    );

    let file = parse_rule_file(yaml, Some("file:///rules.yaml")).expect("valid match variants");
    let rules = file.rules();
    assert_eq!(rules.len(), 6);

    // Pattern string shorthand
    assert!(matches!(
        rules[0].principal(),
        RulePrincipal::Search(SearchQueryPrincipal::Match(MatchFormula::Pattern(p)))
            if p == "foo($X)"
    ));

    // Regex
    assert!(matches!(
        rules[1].principal(),
        RulePrincipal::Search(SearchQueryPrincipal::Match(MatchFormula::Regex(r)))
            if r == "bar"
    ));

    // Any
    assert!(matches!(
        rules[2].principal(),
        RulePrincipal::Search(SearchQueryPrincipal::Match(MatchFormula::Any(children)))
            if children.len() == 2
    ));

    // Not
    assert!(matches!(
        rules[3].principal(),
        RulePrincipal::Search(SearchQueryPrincipal::Match(MatchFormula::Not(_)))
    ));

    // Inside
    assert!(matches!(
        rules[4].principal(),
        RulePrincipal::Search(SearchQueryPrincipal::Match(MatchFormula::Inside(_)))
    ));

    // Anywhere
    assert!(matches!(
        rules[5].principal(),
        RulePrincipal::Search(SearchQueryPrincipal::Match(MatchFormula::Anywhere(_)))
    ));
}

❌ New issue: Large Method
parse_match_variants has 80 lines, threshold = 70

@coderabbitai

This comment was marked as resolved.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 24, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== parser.rs: build_search_principal ==="
rg -n "build_search_principal|has_decoration|singleton_formula|validate_search_header|validate_extract_header" crates/sempai-yaml/src/parser.rs | head -40

echo ""
echo "=== engine_tests.rs: first_diagnostic_of_err signature and usage ==="
cat crates/sempai/src/tests/engine_tests.rs

echo ""
echo "=== parser_tests.rs: full file ==="
cat crates/sempai-yaml/src/tests/parser_tests.rs

echo ""
echo "=== users-guide.md: normalisation/normalization occurrences ==="
rg -n "normalisation\|normalization" docs/users-guide.md

Length of output: 14999


🏁 Script executed:

#!/bin/bash
echo "=== build_search_principal body ==="
sed -n '/^fn build_search_principal/,/^}/p' crates/sempai-yaml/src/parser.rs

echo ""
echo "=== users-guide.md: normalisation or normalization ==="
rg -in "normalisa|normaliza" docs/users-guide.md

echo ""
echo "=== engine.rs: normalisa or normaliza occurrences ==="
rg -in "normalisa|normaliza" crates/sempai/src/engine.rs

Length of output: 1658


@leynos: Here is the status of each comment, after annulment screening and codebase inspection.

Annulments: None. Comment 6 asked to change "normalisation" → "normalization", which is the correct en-GB-oxendict (-ize) direction — that is not a violation to annul. Comment 5 proposes helpers that reduce complexity, not increase it — no annulment there either.


# Location Status Notes
1 parser.rsbuild_search_principal precedence ✅ Resolved Rewritten to check structural presence (has_legacy, has_match) before parsing; both-present is rejected unconditionally, eliminating the silent-drop path.
2 engine_tests.rs — assert primary_span on YAML parse error ✅ Resolved first_diagnostic_of_err now returns (DiagnosticCode, Diagnostic); the malformed-YAML test asserts first.primary_span().is_some().
3 parser_tests.rs — tests for non-search modes ✅ Resolved parse_extract_rule, parse_join_rule, parse_taint_rule_new_form, parse_taint_rule_legacy_form, and parse_unknown_mode_rule are all present.
4 parser_tests.rs — more match formula shapes and conflict cases ✅ Resolved parse_match_variants covers Pattern, Regex, Any, Not, Inside, Anywhere; reject_both_legacy_and_match, reject_empty_match_object, and reject_multiple_match_operators cover the error paths.
5 parser.rs — extract shared helpers (validate_*_header, singleton_formula) ❌ Outstanding No singleton_formula, validate_search_header, or validate_extract_header found in parser.rs. The duplication in the "require then drop" pattern and the two match formulas.len() blocks remains.
6 docs/users-guide.md — "normalisation" → "normalization" ✅ Resolved docs/users-guide.md line 1140 now reads "normalization"; crates/sempai/src/engine.rs also uses "normalization" consistently.

Comment 5 remains open. The agent prompt for that refactor (centralising required-field validation into validate_*_header helpers and extracting singleton_formula) still needs to be applied.

@coderabbitai

This comment was marked as resolved.

@coderabbitai

This comment was marked as resolved.

- Introduced a helper function `first_err_diagnostic` to simplify error
  diagnostics extraction in tests.
- Replaced verbose error assertions with the helper for clearer, DRY
  test code.
- Split the previous single complex test into multiple focused tests
  covering match pattern shorthand, regex, and match formula variants
  (any, not, inside, anywhere).
- Added tests to validate that invalid cases produce appropriate
  diagnostics with expected error codes and messages.

This improves test coverage and maintainability for sempai-yaml parser.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
codescene-delta-analysis[bot]

This comment was marked as outdated.

codescene-delta-analysis[bot]

This comment was marked as outdated.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
crates/sempai-yaml/src/parser/mod.rs (1)

134-140: ⚠️ Potential issue | 🟠 Major

Stop routing unknown modes through the search builder.

Line 140 skips principal-family validation for RuleMode::Other(_), but Line 196 still routes the rule into build_search_rule(). That makes forward-compatible modes impossible: a future non-search mode either fails with a misleading search-principal error or parses a search principal while ignoring other principal families. Reject unknown modes explicitly here or preserve their principal payload instead of assuming search.

Also applies to: 194-198

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/sempai-yaml/src/parser/mod.rs` around lines 134 - 140, The match arm
for RuleMode::Other(_) must not silently return Ok(()) and later be routed into
build_search_rule; either reject unknown modes explicitly or preserve their
principal payload instead of treating them as search. Change the
RuleMode::Other(_) arm in the match that currently calls
unexpected_for_search/…/unexpected_for_taint to return a clear error (e.g.,
UnknownRuleMode) or to carry the original principal data forward, and also
update the code path that calls build_search_rule so it never receives
RuleMode::Other(_) (ensure build_search_rule is only invoked for
RuleMode::Search); reference RuleMode::Other,
unexpected_for_search/extract/join/taint, and build_search_rule when making the
change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/sempai-yaml/src/tests/parser_tests/mode_tests.rs`:
- Around line 25-27: Replace the str.to_string() call with String::from(...) in
the assertion to satisfy the str_to_string Clippy lint: update the code that
constructs the LegacyFormula::Pattern (the string literal "source($X)" used
inside LegacyFormula::Pattern in the assert comparing extract.query()) to use
String::from("source($X)") instead of "source($X)".to_string().

---

Duplicate comments:
In `@crates/sempai-yaml/src/parser/mod.rs`:
- Around line 134-140: The match arm for RuleMode::Other(_) must not silently
return Ok(()) and later be routed into build_search_rule; either reject unknown
modes explicitly or preserve their principal payload instead of treating them as
search. Change the RuleMode::Other(_) arm in the match that currently calls
unexpected_for_search/…/unexpected_for_taint to return a clear error (e.g.,
UnknownRuleMode) or to carry the original principal data forward, and also
update the code path that calls build_search_rule so it never receives
RuleMode::Other(_) (ensure build_search_rule is only invoked for
RuleMode::Search); reference RuleMode::Other,
unexpected_for_search/extract/join/taint, and build_search_rule when making the
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0eddd4e7-63c9-4a5b-bd2a-08186d2b00d1

📥 Commits

Reviewing files that changed from the base of the PR and between a8ded9e and 3aaaf13.

📒 Files selected for processing (5)
  • crates/sempai-yaml/src/parser/mod.rs
  • crates/sempai-yaml/src/tests/parser_tests/legacy_tests.rs
  • crates/sempai-yaml/src/tests/parser_tests/match_tests.rs
  • crates/sempai-yaml/src/tests/parser_tests/mod.rs
  • crates/sempai-yaml/src/tests/parser_tests/mode_tests.rs

Comment thread crates/sempai-yaml/src/tests/parser_tests/mode_tests.rs Outdated
@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Mar 27, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response.

crates/sempai-yaml/src/parser/mod.rs

Comment on file

fn unexpected_for_search(raw: &RawRule) -> Vec<&'static str> {
    collect_unexpected(
        raw,
        &[
            (has_extract_fields, "`extract` or `dest-language`"),
            (has_join_fields, "`join`"),
            (has_taint_fields, "`taint` or legacy taint fields"),
        ],
    )
}

❌ New issue: Code Duplication
The module contains 4 functions with similar structure: unexpected_for_extract,unexpected_for_join,unexpected_for_search,unexpected_for_taint

@coderabbitai

This comment was marked as resolved.

Changed usage from `"source($X)".to_string()` to `String::from("source($X)")` in test for consistency and clarity.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
codescene-delta-analysis[bot]

This comment was marked as outdated.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/sempai-yaml/src/tests/parser_tests/mode_tests.rs`:
- Around line 50-53: The tests call first_err_diagnostic(yaml) but discard the
captured span with "_" allowing span-mapping regressions; update the calls
(e.g., the one using first_err_diagnostic and the one around lines 162-167) to
capture the span boolean (e.g., let (code, message, has_span) =
first_err_diagnostic(yaml)) and then assert that has_span is true for
schema-invalid diagnostics where a span is expected, while keeping the existing
assertions for DiagnosticCode::ESempaiSchemaInvalid and message contents intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9c8518a6-6197-45c6-90ac-d1dc267efc92

📥 Commits

Reviewing files that changed from the base of the PR and between 3aaaf13 and caa066e.

📒 Files selected for processing (1)
  • crates/sempai-yaml/src/tests/parser_tests/mode_tests.rs

Comment thread crates/sempai-yaml/src/tests/parser_tests/mode_tests.rs Outdated
…idate_principal_family

Removed separate functions for collecting unexpected principal fields per rule mode and instead inlined the calls to `collect_unexpected` directly in the `validate_principal_family` match arms. This simplifies the code by reducing indirection and unused helpers while maintaining the same validation logic.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
codescene-delta-analysis[bot]

This comment was marked as outdated.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
crates/sempai-yaml/src/parser/mod.rs (1)

128-130: ⚠️ Potential issue | 🟠 Major

Stop treating RuleMode::Other(_) as search.

Lines 128-129 deliberately skip principal-family validation for unknown modes, but Lines 185-186 still pass them to build_search_rule. That makes the parser accept only unknown modes that happen to look like search rules and misreports everything else as a search-schema failure. Fail fast with an explicit unsupported-mode diagnostic until the model can retain an opaque principal.

🔧 Proposed fix
-        // Unknown modes are preserved for forward compatibility and treated as search-like
-        RuleMode::Search | RuleMode::Other(_) => {
-            build_search_rule(&raw, rule_span.clone(), source_map)?
-        }
+        RuleMode::Search => build_search_rule(&raw, rule_span.clone(), source_map)?,
+        RuleMode::Other(other) => {
+            return Err(schema_error(
+                format!("unsupported rule mode `{other}`"),
+                rule_span.clone(),
+                "use `search`, `extract`, `join`, or `taint`",
+            ));
+        }

Also applies to: 183-187

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/sempai-yaml/src/parser/mod.rs` around lines 128 - 130, The parser
currently treats RuleMode::Other(_) as if it were a search rule by skipping
validation (returning Ok(())) but still later passes unknown modes into
build_search_rule, causing misreported search-schema errors; change the handling
of RuleMode::Other(_) so it returns an explicit unsupported-mode
error/diagnostic (i.e., return Err with a clear UnsupportedMode/unsupported_mode
diagnostic) instead of Ok(()), and ensure any code that calls build_search_rule
only does so for known search modes (leave build_search_rule calls unaffected
but gate them behind the known RuleMode variant check so RuleMode::Other(_)
never reaches build_search_rule).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/sempai-yaml/src/parser/mod.rs`:
- Around line 128-130: The parser currently treats RuleMode::Other(_) as if it
were a search rule by skipping validation (returning Ok(())) but still later
passes unknown modes into build_search_rule, causing misreported search-schema
errors; change the handling of RuleMode::Other(_) so it returns an explicit
unsupported-mode error/diagnostic (i.e., return Err with a clear
UnsupportedMode/unsupported_mode diagnostic) instead of Ok(()), and ensure any
code that calls build_search_rule only does so for known search modes (leave
build_search_rule calls unaffected but gate them behind the known RuleMode
variant check so RuleMode::Other(_) never reaches build_search_rule).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 99330207-07bf-4c6d-bcec-acbbaa55ecad

📥 Commits

Reviewing files that changed from the base of the PR and between caa066e and c084f7f.

📒 Files selected for processing (1)
  • crates/sempai-yaml/src/parser/mod.rs

Add assertions in parser mode tests to check that schema errors
include a primary span, ensuring error diagnostics are complete
with location information.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
codescene-delta-analysis[bot]

This comment was marked as outdated.

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Mar 27, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 27, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Mar 27, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 27, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/sempai-yaml/src/tests/parser_tests/mode_tests.rs`:
- Around line 133-146: The test case named join_with_match in mode_tests.rs is
missing a languages header which lets header validation fail before the intended
cross-mode principal assertion; update the YAML string in the
#[case::join_with_match(...)] fixture to include a languages field (e.g. add a
"languages: []" or appropriate languages list under the rules header) so the
parser reaches the join-vs-match principal error and the test asserts the
expected message "Join mode rule contains unexpected principal fields: `match`
or legacy search keys".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 86b30187-2250-4afa-a23b-c532741cad58

📥 Commits

Reviewing files that changed from the base of the PR and between c084f7f and c9af586.

📒 Files selected for processing (1)
  • crates/sempai-yaml/src/tests/parser_tests/mode_tests.rs

Comment thread crates/sempai-yaml/src/tests/parser_tests/mode_tests.rs
Add the 'languages: [python]' field to a rule in parser_tests/mode_tests.rs to enhance test coverage for language-specific rule parsing.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
codescene-delta-analysis[bot]

This comment was marked as outdated.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/sempai-yaml/src/tests/parser_tests/mode_tests.rs`:
- Around line 118-132: The test case named extract_with_join is missing the
severity header so header validation can fail first; update the YAML fixture
string in the #[case::extract_with_join(...)] input to include a severity field
(e.g., add "severity: WARNING\n" alongside the other top-level rule fields) so
the test still triggers the intended "Extract mode rule contains unexpected
principal fields: `join`" diagnostic rather than a missing-severity error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 22bd0cbe-6b3a-4502-98f5-a36cedc930d7

📥 Commits

Reviewing files that changed from the base of the PR and between c9af586 and da053a1.

📒 Files selected for processing (1)
  • crates/sempai-yaml/src/tests/parser_tests/mode_tests.rs

Comment thread crates/sempai-yaml/src/tests/parser_tests/mode_tests.rs
Include the 'severity: WARNING' field in the extract mode test rule to ensure proper parsing and handling of severity attributes in mode parsing tests.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
@leynos leynos merged commit 57f50f0 into main Mar 28, 2026
11 checks passed
@leynos leynos deleted the implement-yaml-rule-parsing-yt0shv branch March 28, 2026 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant