From a8a5ed2673e464f27e6b9eadb1e3caa8d36e3ca1 Mon Sep 17 00:00:00 2001 From: Matt Van Horn <455140+mvanhorn@users.noreply.github.com> Date: Tue, 28 Apr 2026 10:11:26 -0700 Subject: [PATCH 1/2] feat(ce-code-review): add ce-codex-reviewer for cross-model validation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a conditional review persona that delegates to OpenAI's Codex CLI for an independent second-opinion read on the diff. Codex catches blind spots same-model reviewers share — different model, different reasoning patterns, different recall. Reapplies PR #356 onto current main after the v3 restructure (ce-review -> ce-code-review, agents// -> agents/ce-*.agent.md, ce: -> ce-, new subagent template that passes pre-computed diff). Addresses both Codex findings still active on the prior commit: - P1 (line 54): "Stop assuming ce:review passes base branch context." The new subagent template passes the pre-computed diff directly; ce-codex-reviewer materializes that diff to a tempfile and feeds it to codex. No base-branch resolution, no git symbolic-ref or rev-parse, no fall-back chain to break on non-default targets. - P2 (line 123): "Remove cross-persona suppression rule." The agent no longer instructs codex to suppress findings already covered by other personas. Synthesis dedupes centrally in Stage 5; the parallel subagent has no visibility into other reviewers' outputs anyway. Pieces: - agents/ce-codex-reviewer.agent.md — new persona, anchored confidence rubric (50/75; 100 never emitted because second-opinion is not direct verification), two-phase environment guard (CODEX_SANDBOX/SESSION_ID recursion check + which codex availability check), pipe-delimited output translation into the structured findings schema. - skills/ce-code-review/references/persona-catalog.md — codex registered as conditional persona #9. Selected when the diff is non-trivial (>=25 changed non-test lines) or correctness-sensitive (auth, payments, data, parsing, concurrency, external API contracts) AND the user-level codex CLI is on PATH. - skills/ce-code-review/SKILL.md — codex added to the cross-cutting conditional dispatch table; reviewer count bumped 18 -> 19. - plugins/compound-engineering/README.md — codex listed in the Review agents table (insertion in sorted order). - tests/review-skill-contract.test.ts — codex added to the anchored- rubric persona loop, plus four new contract tests guarding (a) frontmatter shape, (b) the no-base-branch-resolution rule with explicit P1 regression callouts, (c) the no-cross-persona-suppression rule with explicit P2 regression callout, (d) catalog/SKILL.md registration, and (e) the two environment guards before any codex invocation. Verification: - bun test tests/review-skill-contract.test.ts: 28 pass (was 24) - bun test full suite: 992 pass, 9 pre-existing failures in detect-project-type.sh unchanged (present on main before this PR) - bun run release:validate: clean (52 agents, 35 skills, 0 MCP) No release-owned files touched: plugin.json, marketplace.json, and CHANGELOG.md remain untouched per AGENTS.md rules. --- plugins/compound-engineering/README.md | 1 + .../agents/ce-codex-reviewer.agent.md | 106 ++++++++++++++++++ .../skills/ce-code-review/SKILL.md | 3 +- .../references/persona-catalog.md | 5 +- tests/review-skill-contract.test.ts | 66 +++++++++++ 5 files changed, 178 insertions(+), 3 deletions(-) create mode 100644 plugins/compound-engineering/agents/ce-codex-reviewer.agent.md diff --git a/plugins/compound-engineering/README.md b/plugins/compound-engineering/README.md index e5cd59a88..84cef097f 100644 --- a/plugins/compound-engineering/README.md +++ b/plugins/compound-engineering/README.md @@ -109,6 +109,7 @@ Agents are specialized subagents invoked by skills — you typically don't call | `ce-cli-readiness-reviewer` | CLI agent-readiness persona for ce-code-review (conditional, structured JSON) | | `ce-architecture-strategist` | Analyze architectural decisions and compliance | | `ce-code-simplicity-reviewer` | Final pass for simplicity and minimalism | +| `ce-codex-reviewer` | Cross-model independent validation via OpenAI Codex CLI (conditional, structured JSON) | | `ce-correctness-reviewer` | Logic errors, edge cases, state bugs | | `ce-data-integrity-guardian` | Database migrations and data integrity | | `ce-data-migration-expert` | Validate ID mappings match production, check for swapped values | diff --git a/plugins/compound-engineering/agents/ce-codex-reviewer.agent.md b/plugins/compound-engineering/agents/ce-codex-reviewer.agent.md new file mode 100644 index 000000000..edfac5ab8 --- /dev/null +++ b/plugins/compound-engineering/agents/ce-codex-reviewer.agent.md @@ -0,0 +1,106 @@ +--- +name: ce-codex-reviewer +description: Conditional code-review persona that delegates review to OpenAI Codex CLI for cross-model validation. Spawned by ce-code-review when independent second-opinion review of a diff would catch model-shared blind spots. +model: inherit +tools: Read, Grep, Glob, Bash +color: orange +--- + +# Codex Reviewer (Cross-Model Validation) + +You bridge the ce-code-review pipeline to OpenAI's Codex CLI for an independent second opinion on the diff. Your value is catching blind spots that same-model reviewers share — codex sees the same diff through a different model's reasoning patterns. + +## Step 1: Environment guards + +Two guards run sequentially, both fail-closed. + +**Guard A — already inside Codex.** Recursing into codex from within codex breaks or hangs. If `CODEX_SANDBOX` or `CODEX_SESSION_ID` is set in the environment, return the empty findings JSON below with `residual_risks: ["codex-reviewer skipped: already running inside Codex sandbox"]` and stop. + +```bash +echo "CODEX_SANDBOX=${CODEX_SANDBOX:-unset} CODEX_SESSION_ID=${CODEX_SESSION_ID:-unset}" +``` + +**Guard B — codex CLI not installed.** If `codex` is not on PATH, return the empty findings JSON with `residual_risks: ["codex-reviewer skipped: codex CLI not installed (https://openai.com/codex)"]` and stop. Do not suppress errors from the lookup — let stderr surface. + +```bash +which codex +``` + +## Step 2: Materialize the diff + +The orchestrator passes the pre-computed diff in the `` block as `{diff}`. You do not compute it yourself, do not resolve a base branch, and do not call `git diff`. The diff is the authoritative review surface. + +Write the diff from your context to a tempfile so codex can read it: + +```bash +DIFF_FILE=$(mktemp -t codex-review-XXXXXX.patch) +``` + +Then write the literal diff content from `` into `$DIFF_FILE` using a heredoc with a unique sentinel that does not appear in the diff (e.g., `___CODEX_DIFF_END___`). + +If the diff is empty (no changes to review), return the empty findings JSON with `residual_risks: ["codex-reviewer skipped: empty diff"]` and stop. + +## Step 3: Run codex on the diff + +Invoke codex in non-interactive mode. Prefer the project's configured default model — do not pin `-m` or `-c reasoning` flags here so the user's `~/.codex/config.toml` settings apply. + +```bash +codex exec --sandbox read-only --skip-git-repo-check "Review the unified diff in $DIFF_FILE for correctness, security, reliability, and contract issues. Output one finding per issue in this exact format on separate lines: + +SEVERITY|FILE|LINE|TITLE|EVIDENCE + +Where SEVERITY is P0/P1/P2/P3, FILE is the path from the diff, LINE is the line number (or 0 if file-level), TITLE is a short imperative sentence, EVIDENCE is the specific code snippet that supports the finding. If you find no issues, output exactly: NO_FINDINGS. + +Do not output prose. Do not summarize. Do not output anything other than the pipe-delimited lines or NO_FINDINGS." +``` + +Capture stdout. Clean up the tempfile: `rm -f "$DIFF_FILE"`. + +If codex exits non-zero, return the empty findings JSON with `residual_risks: ["codex review failed: "]` and stop. Do not retry. + +## Step 4: Translate codex output into findings + +Parse the pipe-delimited lines from codex's stdout. Skip any line that does not have exactly five `|`-separated fields. Skip the literal `NO_FINDINGS` token. + +For each parsed line, build a finding object that conforms to the findings schema: + +- **`title`**: the TITLE field from codex. +- **`severity`**: the SEVERITY field, one of `"P0"`, `"P1"`, `"P2"`, `"P3"`. Reject anything else; if codex emitted a different vocabulary, drop the line silently. +- **`file`**: the FILE field. Verify the path appears in the diff's `Changed files` list from ``; drop the finding if not. +- **`line`**: the LINE field as an integer (0 means file-level). +- **`evidence`**: an array containing the EVIDENCE field as one element. Always wrap in an array — a bare string is a schema violation. +- **`why_it_matters`**: write 2-4 sentences explaining the observable consequence of the issue, grounded in the EVIDENCE codex provided. Lead with what a user, caller, or operator experiences. If you cannot articulate a concrete consequence from the codex output, drop the finding — the schema's why_it_matters bar is non-negotiable. +- **`autofix_class`**: default `"manual"`. Cross-model findings carry interpretive uncertainty; the orchestrator's synthesis re-classifies during the merge step. +- **`owner`**: default `"downstream-resolver"`. +- **`requires_verification`**: `true`. +- **`pre_existing`**: `false` unless the diff text shows the cited line was unchanged (a `-` or context line, not a `+`). +- **`suggested_fix`**: include only when codex's output named a concrete change. Do not invent fixes. +- **`confidence`**: use the anchored rubric per the subagent template. Codex outputs are second-opinion, so emit conservatively: + - **Anchor 50** — default for any codex finding you cannot independently verify against the diff and surrounding code. The orchestrator routes 50 to soft buckets unless the severity is P0. + - **Anchor 75** — codex named a specific file and line AND you can articulate the concrete observable consequence (the standard `75` bar: a wrong result, an unhandled error path, a contract mismatch, or missing coverage that a real test scenario would surface). + - **Anchor 100** — never emit. Codex is a second opinion, not direct verification of the kind that justifies absolute certainty. + - **Anchor 25 or below — suppress.** Drop the finding silently. The subagent template suppresses anchors `0` and `25` automatically; this persona enforces the same floor at translation time so noise does not enter synthesis. + +## Step 5: Output + +Return the findings as JSON matching the contract in the subagent template. Honor the artifact-file write if a Run ID is present: write the full analysis to `/tmp/compound-engineering/ce-code-review/{run_id}/ce-codex-reviewer.json` and return the compact form to the parent. + +The `reviewer` field at the top of your output is `"codex"` (matches the persona-catalog entry). + +Empty-findings response shape (used by all early-return paths in Steps 1-3): + +```json +{ + "reviewer": "codex", + "findings": [], + "residual_risks": [""], + "testing_gaps": [] +} +``` + +## What you do not do + +- **Do not resolve a base branch or recompute the diff.** The orchestrator's diff is authoritative. Calling `git diff`, `git symbolic-ref`, or `git rev-parse` here was the failure mode flagged by Codex on PR #356 (P1 finding "Stop assuming ce:review passes base branch context"); the new subagent template provides the diff directly so the question does not arise. +- **Do not suppress findings as "already covered by other personas."** You run as an independent parallel subagent and have no visibility into other reviewers' outputs. Synthesis dedupes findings centrally in Stage 5; suppressing here forces guesswork and silently drops valid issues. +- **Do not emit findings about style, formatting, or linter-domain concerns.** Codex sometimes produces these; filter them out at translation time. +- **Do not retry codex on failure.** A single non-zero exit returns the empty-findings JSON with the failure reason in `residual_risks` and stops. diff --git a/plugins/compound-engineering/skills/ce-code-review/SKILL.md b/plugins/compound-engineering/skills/ce-code-review/SKILL.md index ebef22576..0fe6716d7 100644 --- a/plugins/compound-engineering/skills/ce-code-review/SKILL.md +++ b/plugins/compound-engineering/skills/ce-code-review/SKILL.md @@ -106,7 +106,7 @@ Routing rules: ## Reviewers -18 reviewer personas in layered conditionals, plus CE-specific agents. See the persona catalog included below for the full catalog. +19 reviewer personas in layered conditionals, plus CE-specific agents. See the persona catalog included below for the full catalog. **Always-on (every review):** @@ -131,6 +131,7 @@ Routing rules: | `ce-adversarial-reviewer` | Diff >=50 changed non-test/non-generated/non-lockfile lines, or auth, payments, data mutations, external APIs | | `ce-cli-readiness-reviewer` | CLI command definitions, argument parsing, CLI framework usage, command handler implementations | | `ce-previous-comments-reviewer` | Reviewing a PR that has existing review comments or threads | +| `ce-codex-reviewer` | Cross-model independent validation when the diff is non-trivial AND the user-level Codex CLI is available on PATH (>=25 changed non-test lines, or correctness-sensitive domains: auth, payments, data, parsing, concurrency, external API contracts) | **Stack-specific conditional (selected per diff):** diff --git a/plugins/compound-engineering/skills/ce-code-review/references/persona-catalog.md b/plugins/compound-engineering/skills/ce-code-review/references/persona-catalog.md index 15dff7128..736a8b7df 100644 --- a/plugins/compound-engineering/skills/ce-code-review/references/persona-catalog.md +++ b/plugins/compound-engineering/skills/ce-code-review/references/persona-catalog.md @@ -1,6 +1,6 @@ # Persona Catalog -18 reviewer personas organized into always-on, cross-cutting conditional, and stack-specific conditional layers, plus CE-specific agents. The orchestrator uses this catalog to select which reviewers to spawn for each review. +19 reviewer personas organized into always-on, cross-cutting conditional, and stack-specific conditional layers, plus CE-specific agents. The orchestrator uses this catalog to select which reviewers to spawn for each review. ## Always-on (4 personas + 2 CE agents) @@ -22,7 +22,7 @@ Spawned on every review regardless of diff content. | `ce-agent-native-reviewer` | Verify new features are agent-accessible | | `ce-learnings-researcher` | Search docs/solutions/ for past issues related to this PR's modules and patterns | -## Conditional (8 personas) +## Conditional (9 personas) Spawned when the orchestrator identifies relevant patterns in the diff. The orchestrator reads the full diff and reasons about selection -- this is agent judgment, not keyword matching. @@ -36,6 +36,7 @@ Spawned when the orchestrator identifies relevant patterns in the diff. The orch | `adversarial` | `ce-adversarial-reviewer` | Diff has >=50 changed non-test, non-generated, non-lockfile lines, OR touches auth, payments, data mutations, external API integrations, or other high-risk domains | | `cli-readiness` | `ce-cli-readiness-reviewer` | CLI command definitions, argument parsing, CLI framework usage, command handler implementations | | `previous-comments` | `ce-previous-comments-reviewer` | **PR-only.** Reviewing a PR that has existing review comments or review threads from prior review rounds. Skip entirely when no PR metadata was gathered in Stage 1. | +| `codex` | `ce-codex-reviewer` | Cross-model independent validation when the diff is non-trivial (>= 25 changed non-test/non-generated lines, or touches correctness-sensitive domains: auth, payments, data mutations, parsing, concurrency, external API contracts) AND the user-level Codex CLI is available on PATH. Codex sees the same diff through a different model's reasoning and catches blind spots same-model reviewers share. Skip silently when codex is not installed (the agent handles its own environment guard). | ## Stack-Specific Conditional (6 personas) diff --git a/tests/review-skill-contract.test.ts b/tests/review-skill-contract.test.ts index 8be83fb37..01ca6e27a 100644 --- a/tests/review-skill-contract.test.ts +++ b/tests/review-skill-contract.test.ts @@ -492,6 +492,7 @@ describe("ce-code-review contract", () => { "ce-julik-frontend-races-reviewer", "ce-swift-ios-reviewer", "ce-agent-native-reviewer", + "ce-codex-reviewer", ] for (const persona of personas) { @@ -696,6 +697,71 @@ describe("ce-code-review contract", () => { }) }) +describe("codex-reviewer contract", () => { + test("agent file exists with required frontmatter", async () => { + const content = await readRepoFile( + "plugins/compound-engineering/agents/ce-codex-reviewer.agent.md", + ) + const { data } = parseFrontmatter(content) + expect(data.name).toBe("ce-codex-reviewer") + expect(typeof data.description).toBe("string") + expect(data.tools).toContain("Bash") + }) + + test("uses pre-computed diff from review context, not its own base-branch resolution", async () => { + const content = await readRepoFile( + "plugins/compound-engineering/agents/ce-codex-reviewer.agent.md", + ) + // Addresses Codex P1 finding from PR #356 (March 2026): the agent must + // not assume the orchestrator passes a base branch. The new subagent + // template passes the pre-computed diff directly; the agent uses that. + expect(content).toContain("pre-computed diff") + expect(content).toMatch(/do not.*resolve.*base branch/i) + // Negative checks: must not call git diff / symbolic-ref / rev-parse + // for base-branch resolution at the top level of the agent's flow. + expect(content).not.toMatch(/git symbolic-ref refs\/remotes\/origin\/HEAD/) + expect(content).not.toMatch(/git rev-parse --verify.*main/) + }) + + test("does not suppress findings as 'already covered by other personas'", async () => { + const content = await readRepoFile( + "plugins/compound-engineering/agents/ce-codex-reviewer.agent.md", + ) + // Addresses Codex P2 finding from PR #356 (March 2026): the agent runs + // as an independent parallel subagent and has no visibility into other + // reviewers' outputs. Synthesis dedupes centrally in Stage 5. + expect(content).toMatch(/synthesis dedupes findings centrally/i) + expect(content).not.toMatch(/Findings already covered by other personas/i) + }) + + test("registered in persona-catalog and SKILL.md dispatch table", async () => { + const catalog = await readRepoFile( + "plugins/compound-engineering/skills/ce-code-review/references/persona-catalog.md", + ) + expect(catalog).toContain("`codex`") + expect(catalog).toContain("`ce-codex-reviewer`") + + const skill = await readRepoFile( + "plugins/compound-engineering/skills/ce-code-review/SKILL.md", + ) + expect(skill).toContain("`ce-codex-reviewer`") + }) + + test("environment guards run before any codex invocation", async () => { + const content = await readRepoFile( + "plugins/compound-engineering/agents/ce-codex-reviewer.agent.md", + ) + // Two guards required by the design: already-inside-codex sandbox detection + // (CODEX_SANDBOX / CODEX_SESSION_ID) and codex CLI availability check. + expect(content).toContain("CODEX_SANDBOX") + expect(content).toContain("CODEX_SESSION_ID") + expect(content).toMatch(/which codex/) + // Both guards must fail closed with the empty-findings JSON. + expect(content).toContain("codex-reviewer skipped: already running inside Codex sandbox") + expect(content).toContain("codex-reviewer skipped: codex CLI not installed") + }) +}) + describe("testing-reviewer contract", () => { test("includes behavioral-changes-with-no-test-additions check", async () => { const content = await readRepoFile("plugins/compound-engineering/agents/ce-testing-reviewer.agent.md") From adba28a8f4a5e4b98de71c399197878d15e896cd Mon Sep 17 00:00:00 2001 From: Matt Van Horn <455140+mvanhorn@users.noreply.github.com> Date: Tue, 28 Apr 2026 10:39:37 -0700 Subject: [PATCH 2/2] fix(ce-codex-reviewer): NDJSON output and schema-valid line numbers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses both Codex findings on a8a5ed2 (April 2026): P1 — Emit schema-valid line numbers for file-level findings. The findings schema requires `line >= 1` (verified in references/findings-schema.json). The previous prompt told codex to emit `LINE=0` for file-level issues, which would have been dropped by the merge validator as malformed before synthesis — silently losing every file-level Codex finding. Fix: the codex prompt now requires `line` to be a positive integer and adds a `file_level` boolean. When `file_level: true`, codex sets `line: 1` and the agent prepends a "file-level finding (no specific line applies)" string to the evidence array so synthesis and downstream surfaces can still distinguish "line 1 was the issue" from "this is a whole-file concern." The `file_level` signal is preserved in evidence rather than as a separate field because the findings schema doesn't expose a top-level file_level flag, and inventing one would fail the strict-validator path. P2 — Use structured output so evidence cannot break parsing. The previous pipe-delimited contract dropped any row that wasn't exactly five `|`-separated fields. EVIDENCE is a raw code snippet that can legitimately contain `|` (bitwise OR / union types, shell pipes, markdown tables, regex alternation). When that happened, valid findings disappeared silently. Fix: switch the codex prompt to NDJSON — one JSON object per line. The agent JSON-parses each line independently; embedded pipes in evidence are no longer a parsing hazard because JSON quoting handles them. Lines that fail to parse are skipped (no retry, no inference). Test coverage: - New contract test "uses NDJSON output contract so evidence can carry pipes safely" guards the prompt format and asserts the pipe-delimited shape stays gone. - New contract test "emits schema-valid line numbers for file-level findings" guards the line=1 + file_level=true convention, asserts the old "0 means file-level" wording is gone, and reads the findings schema to verify minimum=1 still holds (so a future schema relaxation triggers a deliberate test update rather than a silent drift). Both tests cite their originating Codex finding inline so future regressions get a clear pointer to PR #356. bun test tests/review-skill-contract.test.ts: 30 pass (was 28) bun run release:validate: clean --- .../agents/ce-codex-reviewer.agent.md | 28 ++++++++----- tests/review-skill-contract.test.ts | 40 +++++++++++++++++++ 2 files changed, 57 insertions(+), 11 deletions(-) diff --git a/plugins/compound-engineering/agents/ce-codex-reviewer.agent.md b/plugins/compound-engineering/agents/ce-codex-reviewer.agent.md index edfac5ab8..61db744e1 100644 --- a/plugins/compound-engineering/agents/ce-codex-reviewer.agent.md +++ b/plugins/compound-engineering/agents/ce-codex-reviewer.agent.md @@ -45,13 +45,19 @@ If the diff is empty (no changes to review), return the empty findings JSON with Invoke codex in non-interactive mode. Prefer the project's configured default model — do not pin `-m` or `-c reasoning` flags here so the user's `~/.codex/config.toml` settings apply. ```bash -codex exec --sandbox read-only --skip-git-repo-check "Review the unified diff in $DIFF_FILE for correctness, security, reliability, and contract issues. Output one finding per issue in this exact format on separate lines: +codex exec --sandbox read-only --skip-git-repo-check "Review the unified diff in $DIFF_FILE for correctness, security, reliability, and contract issues. Output one JSON object per line (NDJSON), with these fields per object: -SEVERITY|FILE|LINE|TITLE|EVIDENCE + {\"severity\": \"P0|P1|P2|P3\", \"file\": \"path/from/diff\", \"line\": , \"file_level\": , \"title\": \"short imperative sentence\", \"evidence\": \"specific code snippet that supports the finding\"} -Where SEVERITY is P0/P1/P2/P3, FILE is the path from the diff, LINE is the line number (or 0 if file-level), TITLE is a short imperative sentence, EVIDENCE is the specific code snippet that supports the finding. If you find no issues, output exactly: NO_FINDINGS. +Rules for each field: +- severity: exactly one of \"P0\", \"P1\", \"P2\", \"P3\". +- file: a path from the diff's Changed files list. No quoting, no expansion. +- line: a positive integer (1 or higher). For file-level findings (no specific line), use 1 and set file_level to true. +- file_level: true when the issue is whole-file and no specific line applies; false when the line number is precise. +- title: a short imperative sentence describing the issue. +- evidence: a string with the exact code snippet, quote, or line content that supports the finding. May contain any characters including pipes, JSON-escaped per JSON rules. -Do not output prose. Do not summarize. Do not output anything other than the pipe-delimited lines or NO_FINDINGS." +Output one JSON object per line. Do not wrap them in an array. Do not output prose, headers, or commentary. If you find no issues, output exactly the literal token: NO_FINDINGS" ``` Capture stdout. Clean up the tempfile: `rm -f "$DIFF_FILE"`. @@ -60,15 +66,15 @@ If codex exits non-zero, return the empty findings JSON with `residual_risks: [" ## Step 4: Translate codex output into findings -Parse the pipe-delimited lines from codex's stdout. Skip any line that does not have exactly five `|`-separated fields. Skip the literal `NO_FINDINGS` token. +Parse codex's stdout line-by-line as NDJSON. Skip blank lines. Skip the literal `NO_FINDINGS` token. For each remaining line, attempt to JSON-parse it; if parsing fails, skip the line (do not retry, do not infer). -For each parsed line, build a finding object that conforms to the findings schema: +For each successfully-parsed object, build a finding object that conforms to the findings schema: -- **`title`**: the TITLE field from codex. -- **`severity`**: the SEVERITY field, one of `"P0"`, `"P1"`, `"P2"`, `"P3"`. Reject anything else; if codex emitted a different vocabulary, drop the line silently. -- **`file`**: the FILE field. Verify the path appears in the diff's `Changed files` list from ``; drop the finding if not. -- **`line`**: the LINE field as an integer (0 means file-level). -- **`evidence`**: an array containing the EVIDENCE field as one element. Always wrap in an array — a bare string is a schema violation. +- **`title`**: the `title` field from the codex output. +- **`severity`**: the `severity` field, one of `"P0"`, `"P1"`, `"P2"`, `"P3"`. Reject anything else; if codex emitted a different vocabulary, drop the object silently. +- **`file`**: the `file` field. Verify the path appears in the diff's `Changed files` list from ``; drop the finding if not. +- **`line`**: the `line` field as an integer. The findings schema requires `line >= 1`, so reject any object with `line < 1` outright. When codex set `file_level: true`, line will already be `1` per the contract above; the agent does not re-default `0` to `1` because the contract refuses `0`. The `file_level` signal is preserved in `evidence` (see below) so synthesis and downstream surfaces can distinguish "line 1 was the issue" from "this is a file-level concern." +- **`evidence`**: an array. Always include the `evidence` string from codex as the first element. When `file_level: true`, prepend an additional element: `"file-level finding (no specific line applies)"`. The schema requires evidence to be a non-empty array of strings; never emit a bare string. - **`why_it_matters`**: write 2-4 sentences explaining the observable consequence of the issue, grounded in the EVIDENCE codex provided. Lead with what a user, caller, or operator experiences. If you cannot articulate a concrete consequence from the codex output, drop the finding — the schema's why_it_matters bar is non-negotiable. - **`autofix_class`**: default `"manual"`. Cross-model findings carry interpretive uncertainty; the orchestrator's synthesis re-classifies during the merge step. - **`owner`**: default `"downstream-resolver"`. diff --git a/tests/review-skill-contract.test.ts b/tests/review-skill-contract.test.ts index 01ca6e27a..248719eb0 100644 --- a/tests/review-skill-contract.test.ts +++ b/tests/review-skill-contract.test.ts @@ -760,6 +760,46 @@ describe("codex-reviewer contract", () => { expect(content).toContain("codex-reviewer skipped: already running inside Codex sandbox") expect(content).toContain("codex-reviewer skipped: codex CLI not installed") }) + + test("uses NDJSON output contract so evidence can carry pipes safely", async () => { + const content = await readRepoFile( + "plugins/compound-engineering/agents/ce-codex-reviewer.agent.md", + ) + // Addresses Codex P2 finding on commit a8a5ed2 (April 2026): pipe-delimited + // output dropped any row that wasn't exactly five `|`-separated fields, + // so EVIDENCE containing a literal `|` (bitwise OR, shell pipes, markdown + // tables) silently lost real findings. NDJSON is robust to embedded pipes. + expect(content).toContain("NDJSON") + expect(content).toMatch(/JSON-?parse/i) + // Pipe-delimited contract must be gone from the agent's prose. + expect(content).not.toMatch(/SEVERITY\|FILE\|LINE\|TITLE\|EVIDENCE/) + expect(content).not.toMatch(/exactly five `\|`-separated fields/) + }) + + test("emits schema-valid line numbers for file-level findings", async () => { + const content = await readRepoFile( + "plugins/compound-engineering/agents/ce-codex-reviewer.agent.md", + ) + // Addresses Codex P1 finding on commit a8a5ed2 (April 2026): the findings + // schema requires line >= 1, so emitting line=0 for file-level findings + // would silently drop them at the merge validator. Contract says: line=1 + // with file_level=true, evidence array prepends the file-level annotation. + expect(content).toContain("file_level") + expect(content).toMatch(/positive integer/i) + expect(content).toMatch(/file-level finding \(no specific line applies\)/) + // Negative check: the old "0 means file-level" wording must be gone. + expect(content).not.toMatch(/0 means file-level/) + expect(content).not.toMatch(/or 0 if file-level/) + + // Also assert the schema constraint is what we think it is — if upstream + // ever loosens line to allow 0, this test should be updated to match. + const schemaJson = await readRepoFile( + "plugins/compound-engineering/skills/ce-code-review/references/findings-schema.json", + ) + const schema = JSON.parse(schemaJson) + const lineMin = schema.properties?.findings?.items?.properties?.line?.minimum + expect(lineMin).toBe(1) + }) }) describe("testing-reviewer contract", () => {