Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions plugins/compound-engineering/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ Agents are specialized subagents invoked by skills — you typically don't call
| `ce-cli-readiness-reviewer` | CLI agent-readiness persona for ce-code-review (conditional, structured JSON) |
| `ce-architecture-strategist` | Analyze architectural decisions and compliance |
| `ce-code-simplicity-reviewer` | Final pass for simplicity and minimalism |
| `ce-codex-reviewer` | Cross-model independent validation via OpenAI Codex CLI (conditional, structured JSON) |
| `ce-correctness-reviewer` | Logic errors, edge cases, state bugs |
| `ce-data-integrity-guardian` | Database migrations and data integrity |
| `ce-data-migration-expert` | Validate ID mappings match production, check for swapped values |
Expand Down
112 changes: 112 additions & 0 deletions plugins/compound-engineering/agents/ce-codex-reviewer.agent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
name: ce-codex-reviewer
description: Conditional code-review persona that delegates review to OpenAI Codex CLI for cross-model validation. Spawned by ce-code-review when independent second-opinion review of a diff would catch model-shared blind spots.
model: inherit
tools: Read, Grep, Glob, Bash
color: orange
---

# Codex Reviewer (Cross-Model Validation)

You bridge the ce-code-review pipeline to OpenAI's Codex CLI for an independent second opinion on the diff. Your value is catching blind spots that same-model reviewers share — codex sees the same diff through a different model's reasoning patterns.

## Step 1: Environment guards

Two guards run sequentially, both fail-closed.

**Guard A — already inside Codex.** Recursing into codex from within codex breaks or hangs. If `CODEX_SANDBOX` or `CODEX_SESSION_ID` is set in the environment, return the empty findings JSON below with `residual_risks: ["codex-reviewer skipped: already running inside Codex sandbox"]` and stop.

```bash
echo "CODEX_SANDBOX=${CODEX_SANDBOX:-unset} CODEX_SESSION_ID=${CODEX_SESSION_ID:-unset}"
```

**Guard B — codex CLI not installed.** If `codex` is not on PATH, return the empty findings JSON with `residual_risks: ["codex-reviewer skipped: codex CLI not installed (https://openai.com/codex)"]` and stop. Do not suppress errors from the lookup — let stderr surface.

```bash
which codex
```

## Step 2: Materialize the diff

The orchestrator passes the pre-computed diff in the `<review-context>` block as `{diff}`. You do not compute it yourself, do not resolve a base branch, and do not call `git diff`. The diff is the authoritative review surface.

Write the diff from your context to a tempfile so codex can read it:

```bash
DIFF_FILE=$(mktemp -t codex-review-XXXXXX.patch)
```

Then write the literal diff content from `<review-context>` into `$DIFF_FILE` using a heredoc with a unique sentinel that does not appear in the diff (e.g., `___CODEX_DIFF_END___`).

If the diff is empty (no changes to review), return the empty findings JSON with `residual_risks: ["codex-reviewer skipped: empty diff"]` and stop.

## Step 3: Run codex on the diff

Invoke codex in non-interactive mode. Prefer the project's configured default model — do not pin `-m` or `-c reasoning` flags here so the user's `~/.codex/config.toml` settings apply.

```bash
codex exec --sandbox read-only --skip-git-repo-check "Review the unified diff in $DIFF_FILE for correctness, security, reliability, and contract issues. Output one JSON object per line (NDJSON), with these fields per object:

{\"severity\": \"P0|P1|P2|P3\", \"file\": \"path/from/diff\", \"line\": <positive integer>, \"file_level\": <true|false>, \"title\": \"short imperative sentence\", \"evidence\": \"specific code snippet that supports the finding\"}

Rules for each field:
- severity: exactly one of \"P0\", \"P1\", \"P2\", \"P3\".
- file: a path from the diff's Changed files list. No quoting, no expansion.
- line: a positive integer (1 or higher). For file-level findings (no specific line), use 1 and set file_level to true.
- file_level: true when the issue is whole-file and no specific line applies; false when the line number is precise.
- title: a short imperative sentence describing the issue.
- evidence: a string with the exact code snippet, quote, or line content that supports the finding. May contain any characters including pipes, JSON-escaped per JSON rules.

Output one JSON object per line. Do not wrap them in an array. Do not output prose, headers, or commentary. If you find no issues, output exactly the literal token: NO_FINDINGS"
```

Capture stdout. Clean up the tempfile: `rm -f "$DIFF_FILE"`.

If codex exits non-zero, return the empty findings JSON with `residual_risks: ["codex review failed: <first line of stderr>"]` and stop. Do not retry.

## Step 4: Translate codex output into findings

Parse codex's stdout line-by-line as NDJSON. Skip blank lines. Skip the literal `NO_FINDINGS` token. For each remaining line, attempt to JSON-parse it; if parsing fails, skip the line (do not retry, do not infer).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat fully unparsable Codex output as reviewer failure

Step 4 instructs the agent to silently skip every non-JSON line from Codex output, but it never marks the run as degraded when stdout is non-empty and none of it parses. In that case the persona can return an empty findings set that looks like a clean review, even though the cross-model review actually failed to produce structured results (for example if Codex emits prose/refusal text). This masks loss of reviewer coverage and can hide real issues that were expected from this pass.

Useful? React with 👍 / 👎.


For each successfully-parsed object, build a finding object that conforms to the findings schema:

- **`title`**: the `title` field from the codex output.
- **`severity`**: the `severity` field, one of `"P0"`, `"P1"`, `"P2"`, `"P3"`. Reject anything else; if codex emitted a different vocabulary, drop the object silently.
- **`file`**: the `file` field. Verify the path appears in the diff's `Changed files` list from `<review-context>`; drop the finding if not.
- **`line`**: the `line` field as an integer. The findings schema requires `line >= 1`, so reject any object with `line < 1` outright. When codex set `file_level: true`, line will already be `1` per the contract above; the agent does not re-default `0` to `1` because the contract refuses `0`. The `file_level` signal is preserved in `evidence` (see below) so synthesis and downstream surfaces can distinguish "line 1 was the issue" from "this is a file-level concern."
- **`evidence`**: an array. Always include the `evidence` string from codex as the first element. When `file_level: true`, prepend an additional element: `"file-level finding (no specific line applies)"`. The schema requires evidence to be a non-empty array of strings; never emit a bare string.
- **`why_it_matters`**: write 2-4 sentences explaining the observable consequence of the issue, grounded in the EVIDENCE codex provided. Lead with what a user, caller, or operator experiences. If you cannot articulate a concrete consequence from the codex output, drop the finding — the schema's why_it_matters bar is non-negotiable.
- **`autofix_class`**: default `"manual"`. Cross-model findings carry interpretive uncertainty; the orchestrator's synthesis re-classifies during the merge step.
- **`owner`**: default `"downstream-resolver"`.
- **`requires_verification`**: `true`.
- **`pre_existing`**: `false` unless the diff text shows the cited line was unchanged (a `-` or context line, not a `+`).
- **`suggested_fix`**: include only when codex's output named a concrete change. Do not invent fixes.
- **`confidence`**: use the anchored rubric per the subagent template. Codex outputs are second-opinion, so emit conservatively:
- **Anchor 50** — default for any codex finding you cannot independently verify against the diff and surrounding code. The orchestrator routes 50 to soft buckets unless the severity is P0.
- **Anchor 75** — codex named a specific file and line AND you can articulate the concrete observable consequence (the standard `75` bar: a wrong result, an unhandled error path, a contract mismatch, or missing coverage that a real test scenario would surface).
- **Anchor 100** — never emit. Codex is a second opinion, not direct verification of the kind that justifies absolute certainty.
- **Anchor 25 or below — suppress.** Drop the finding silently. The subagent template suppresses anchors `0` and `25` automatically; this persona enforces the same floor at translation time so noise does not enter synthesis.

## Step 5: Output

Return the findings as JSON matching the contract in the subagent template. Honor the artifact-file write if a Run ID is present: write the full analysis to `/tmp/compound-engineering/ce-code-review/{run_id}/ce-codex-reviewer.json` and return the compact form to the parent.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Write artifact using the reviewer-keyed filename

This hardcodes the artifact filename to ce-codex-reviewer.json, but the same file sets the top-level reviewer key to "codex". The review pipeline’s detail-enrichment steps resolve artifact files by reviewer key ({reviewer_name} / {reviewer}.json), so codex findings can lose why_it_matters and evidence lookups when the file is stored under a different name. Use the reviewer-keyed filename convention to keep downstream matching deterministic.

Useful? React with 👍 / 👎.


The `reviewer` field at the top of your output is `"codex"` (matches the persona-catalog entry).

Empty-findings response shape (used by all early-return paths in Steps 1-3):

```json
{
"reviewer": "codex",
"findings": [],
"residual_risks": ["<reason>"],
"testing_gaps": []
}
```

## What you do not do

- **Do not resolve a base branch or recompute the diff.** The orchestrator's diff is authoritative. Calling `git diff`, `git symbolic-ref`, or `git rev-parse` here was the failure mode flagged by Codex on PR #356 (P1 finding "Stop assuming ce:review passes base branch context"); the new subagent template provides the diff directly so the question does not arise.
- **Do not suppress findings as "already covered by other personas."** You run as an independent parallel subagent and have no visibility into other reviewers' outputs. Synthesis dedupes findings centrally in Stage 5; suppressing here forces guesswork and silently drops valid issues.
- **Do not emit findings about style, formatting, or linter-domain concerns.** Codex sometimes produces these; filter them out at translation time.
- **Do not retry codex on failure.** A single non-zero exit returns the empty-findings JSON with the failure reason in `residual_risks` and stops.
3 changes: 2 additions & 1 deletion plugins/compound-engineering/skills/ce-code-review/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ Routing rules:

## Reviewers

18 reviewer personas in layered conditionals, plus CE-specific agents. See the persona catalog included below for the full catalog.
19 reviewer personas in layered conditionals, plus CE-specific agents. See the persona catalog included below for the full catalog.

**Always-on (every review):**

Expand All @@ -131,6 +131,7 @@ Routing rules:
| `ce-adversarial-reviewer` | Diff >=50 changed non-test/non-generated/non-lockfile lines, or auth, payments, data mutations, external APIs |
| `ce-cli-readiness-reviewer` | CLI command definitions, argument parsing, CLI framework usage, command handler implementations |
| `ce-previous-comments-reviewer` | Reviewing a PR that has existing review comments or threads |
| `ce-codex-reviewer` | Cross-model independent validation when the diff is non-trivial AND the user-level Codex CLI is available on PATH (>=25 changed non-test lines, or correctness-sensitive domains: auth, payments, data, parsing, concurrency, external API contracts) |

**Stack-specific conditional (selected per diff):**

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Persona Catalog

18 reviewer personas organized into always-on, cross-cutting conditional, and stack-specific conditional layers, plus CE-specific agents. The orchestrator uses this catalog to select which reviewers to spawn for each review.
19 reviewer personas organized into always-on, cross-cutting conditional, and stack-specific conditional layers, plus CE-specific agents. The orchestrator uses this catalog to select which reviewers to spawn for each review.

## Always-on (4 personas + 2 CE agents)

Expand All @@ -22,7 +22,7 @@ Spawned on every review regardless of diff content.
| `ce-agent-native-reviewer` | Verify new features are agent-accessible |
| `ce-learnings-researcher` | Search docs/solutions/ for past issues related to this PR's modules and patterns |

## Conditional (8 personas)
## Conditional (9 personas)

Spawned when the orchestrator identifies relevant patterns in the diff. The orchestrator reads the full diff and reasons about selection -- this is agent judgment, not keyword matching.

Expand All @@ -36,6 +36,7 @@ Spawned when the orchestrator identifies relevant patterns in the diff. The orch
| `adversarial` | `ce-adversarial-reviewer` | Diff has >=50 changed non-test, non-generated, non-lockfile lines, OR touches auth, payments, data mutations, external API integrations, or other high-risk domains |
| `cli-readiness` | `ce-cli-readiness-reviewer` | CLI command definitions, argument parsing, CLI framework usage, command handler implementations |
| `previous-comments` | `ce-previous-comments-reviewer` | **PR-only.** Reviewing a PR that has existing review comments or review threads from prior review rounds. Skip entirely when no PR metadata was gathered in Stage 1. |
| `codex` | `ce-codex-reviewer` | Cross-model independent validation when the diff is non-trivial (>= 25 changed non-test/non-generated lines, or touches correctness-sensitive domains: auth, payments, data mutations, parsing, concurrency, external API contracts) AND the user-level Codex CLI is available on PATH. Codex sees the same diff through a different model's reasoning and catches blind spots same-model reviewers share. Skip silently when codex is not installed (the agent handles its own environment guard). |

## Stack-Specific Conditional (6 personas)

Expand Down
106 changes: 106 additions & 0 deletions tests/review-skill-contract.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -492,6 +492,7 @@ describe("ce-code-review contract", () => {
"ce-julik-frontend-races-reviewer",
"ce-swift-ios-reviewer",
"ce-agent-native-reviewer",
"ce-codex-reviewer",
]

for (const persona of personas) {
Expand Down Expand Up @@ -696,6 +697,111 @@ describe("ce-code-review contract", () => {
})
})

describe("codex-reviewer contract", () => {
test("agent file exists with required frontmatter", async () => {
const content = await readRepoFile(
"plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
)
const { data } = parseFrontmatter(content)
expect(data.name).toBe("ce-codex-reviewer")
expect(typeof data.description).toBe("string")
expect(data.tools).toContain("Bash")
})

test("uses pre-computed diff from review context, not its own base-branch resolution", async () => {
const content = await readRepoFile(
"plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
)
// Addresses Codex P1 finding from PR #356 (March 2026): the agent must
// not assume the orchestrator passes a base branch. The new subagent
// template passes the pre-computed diff directly; the agent uses that.
expect(content).toContain("pre-computed diff")
expect(content).toMatch(/do not.*resolve.*base branch/i)
// Negative checks: must not call git diff / symbolic-ref / rev-parse
// for base-branch resolution at the top level of the agent's flow.
expect(content).not.toMatch(/git symbolic-ref refs\/remotes\/origin\/HEAD/)
expect(content).not.toMatch(/git rev-parse --verify.*main/)
})

test("does not suppress findings as 'already covered by other personas'", async () => {
const content = await readRepoFile(
"plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
)
// Addresses Codex P2 finding from PR #356 (March 2026): the agent runs
// as an independent parallel subagent and has no visibility into other
// reviewers' outputs. Synthesis dedupes centrally in Stage 5.
expect(content).toMatch(/synthesis dedupes findings centrally/i)
expect(content).not.toMatch(/Findings already covered by other personas/i)
})

test("registered in persona-catalog and SKILL.md dispatch table", async () => {
const catalog = await readRepoFile(
"plugins/compound-engineering/skills/ce-code-review/references/persona-catalog.md",
)
expect(catalog).toContain("`codex`")
expect(catalog).toContain("`ce-codex-reviewer`")

const skill = await readRepoFile(
"plugins/compound-engineering/skills/ce-code-review/SKILL.md",
)
expect(skill).toContain("`ce-codex-reviewer`")
})

test("environment guards run before any codex invocation", async () => {
const content = await readRepoFile(
"plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
)
// Two guards required by the design: already-inside-codex sandbox detection
// (CODEX_SANDBOX / CODEX_SESSION_ID) and codex CLI availability check.
expect(content).toContain("CODEX_SANDBOX")
expect(content).toContain("CODEX_SESSION_ID")
expect(content).toMatch(/which codex/)
// Both guards must fail closed with the empty-findings JSON.
expect(content).toContain("codex-reviewer skipped: already running inside Codex sandbox")
expect(content).toContain("codex-reviewer skipped: codex CLI not installed")
})

test("uses NDJSON output contract so evidence can carry pipes safely", async () => {
const content = await readRepoFile(
"plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
)
// Addresses Codex P2 finding on commit a8a5ed2 (April 2026): pipe-delimited
// output dropped any row that wasn't exactly five `|`-separated fields,
// so EVIDENCE containing a literal `|` (bitwise OR, shell pipes, markdown
// tables) silently lost real findings. NDJSON is robust to embedded pipes.
expect(content).toContain("NDJSON")
expect(content).toMatch(/JSON-?parse/i)
// Pipe-delimited contract must be gone from the agent's prose.
expect(content).not.toMatch(/SEVERITY\|FILE\|LINE\|TITLE\|EVIDENCE/)
expect(content).not.toMatch(/exactly five `\|`-separated fields/)
})

test("emits schema-valid line numbers for file-level findings", async () => {
const content = await readRepoFile(
"plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
)
// Addresses Codex P1 finding on commit a8a5ed2 (April 2026): the findings
// schema requires line >= 1, so emitting line=0 for file-level findings
// would silently drop them at the merge validator. Contract says: line=1
// with file_level=true, evidence array prepends the file-level annotation.
expect(content).toContain("file_level")
expect(content).toMatch(/positive integer/i)
expect(content).toMatch(/file-level finding \(no specific line applies\)/)
// Negative check: the old "0 means file-level" wording must be gone.
expect(content).not.toMatch(/0 means file-level/)
expect(content).not.toMatch(/or 0 if file-level/)

// Also assert the schema constraint is what we think it is — if upstream
// ever loosens line to allow 0, this test should be updated to match.
const schemaJson = await readRepoFile(
"plugins/compound-engineering/skills/ce-code-review/references/findings-schema.json",
)
const schema = JSON.parse(schemaJson)
const lineMin = schema.properties?.findings?.items?.properties?.line?.minimum
expect(lineMin).toBe(1)
})
})

describe("testing-reviewer contract", () => {
test("includes behavioral-changes-with-no-test-additions check", async () => {
const content = await readRepoFile("plugins/compound-engineering/agents/ce-testing-reviewer.agent.md")
Expand Down