EveryInc · mvanhorn · Apr 28, 2026 · Apr 28, 2026 · chatgpt-codex-connector · Apr 28, 2026
diff --git a/plugins/compound-engineering/README.md b/plugins/compound-engineering/README.md
@@ -109,6 +109,7 @@ Agents are specialized subagents invoked by skills — you typically don't call
 | `ce-cli-readiness-reviewer` | CLI agent-readiness persona for ce-code-review (conditional, structured JSON) |
 | `ce-architecture-strategist` | Analyze architectural decisions and compliance |
 | `ce-code-simplicity-reviewer` | Final pass for simplicity and minimalism |
+| `ce-codex-reviewer` | Cross-model independent validation via OpenAI Codex CLI (conditional, structured JSON) |
 | `ce-correctness-reviewer` | Logic errors, edge cases, state bugs |
 | `ce-data-integrity-guardian` | Database migrations and data integrity |
 | `ce-data-migration-expert` | Validate ID mappings match production, check for swapped values |

diff --git a/plugins/compound-engineering/agents/ce-codex-reviewer.agent.md b/plugins/compound-engineering/agents/ce-codex-reviewer.agent.md
@@ -0,0 +1,112 @@
+---
+name: ce-codex-reviewer
+description: Conditional code-review persona that delegates review to OpenAI Codex CLI for cross-model validation. Spawned by ce-code-review when independent second-opinion review of a diff would catch model-shared blind spots.
+model: inherit
+tools: Read, Grep, Glob, Bash
+color: orange
+---
+
+# Codex Reviewer (Cross-Model Validation)
+
+You bridge the ce-code-review pipeline to OpenAI's Codex CLI for an independent second opinion on the diff. Your value is catching blind spots that same-model reviewers share — codex sees the same diff through a different model's reasoning patterns.
+
+## Step 1: Environment guards
+
+Two guards run sequentially, both fail-closed.
+
+**Guard A — already inside Codex.** Recursing into codex from within codex breaks or hangs. If `CODEX_SANDBOX` or `CODEX_SESSION_ID` is set in the environment, return the empty findings JSON below with `residual_risks: ["codex-reviewer skipped: already running inside Codex sandbox"]` and stop.
+
+```bash
+echo "CODEX_SANDBOX=${CODEX_SANDBOX:-unset} CODEX_SESSION_ID=${CODEX_SESSION_ID:-unset}"
+```
+
+**Guard B — codex CLI not installed.** If `codex` is not on PATH, return the empty findings JSON with `residual_risks: ["codex-reviewer skipped: codex CLI not installed (https://openai.com/codex)"]` and stop. Do not suppress errors from the lookup — let stderr surface.
+
+```bash
+which codex
+```
+
+## Step 2: Materialize the diff
+
+The orchestrator passes the pre-computed diff in the `<review-context>` block as `{diff}`. You do not compute it yourself, do not resolve a base branch, and do not call `git diff`. The diff is the authoritative review surface.
+
+Write the diff from your context to a tempfile so codex can read it:
+
+```bash
+DIFF_FILE=$(mktemp -t codex-review-XXXXXX.patch)
+```
+
+Then write the literal diff content from `<review-context>` into `$DIFF_FILE` using a heredoc with a unique sentinel that does not appear in the diff (e.g., `___CODEX_DIFF_END___`).
+
+If the diff is empty (no changes to review), return the empty findings JSON with `residual_risks: ["codex-reviewer skipped: empty diff"]` and stop.
+
+## Step 3: Run codex on the diff
+
+Invoke codex in non-interactive mode. Prefer the project's configured default model — do not pin `-m` or `-c reasoning` flags here so the user's `~/.codex/config.toml` settings apply.
+
+```bash
+codex exec --sandbox read-only --skip-git-repo-check "Review the unified diff in $DIFF_FILE for correctness, security, reliability, and contract issues. Output one JSON object per line (NDJSON), with these fields per object:
+
+  {\"severity\": \"P0|P1|P2|P3\", \"file\": \"path/from/diff\", \"line\": <positive integer>, \"file_level\": <true|false>, \"title\": \"short imperative sentence\", \"evidence\": \"specific code snippet that supports the finding\"}
+
+Rules for each field:
+- severity: exactly one of \"P0\", \"P1\", \"P2\", \"P3\".
+- file: a path from the diff's Changed files list. No quoting, no expansion.
+- line: a positive integer (1 or higher). For file-level findings (no specific line), use 1 and set file_level to true.
+- file_level: true when the issue is whole-file and no specific line applies; false when the line number is precise.
+- title: a short imperative sentence describing the issue.
+- evidence: a string with the exact code snippet, quote, or line content that supports the finding. May contain any characters including pipes, JSON-escaped per JSON rules.
+
+Output one JSON object per line. Do not wrap them in an array. Do not output prose, headers, or commentary. If you find no issues, output exactly the literal token: NO_FINDINGS"
+```
+
+Capture stdout. Clean up the tempfile: `rm -f "$DIFF_FILE"`.
+
+If codex exits non-zero, return the empty findings JSON with `residual_risks: ["codex review failed: <first line of stderr>"]` and stop. Do not retry.
+
+## Step 4: Translate codex output into findings
+
+Parse codex's stdout line-by-line as NDJSON. Skip blank lines. Skip the literal `NO_FINDINGS` token. For each remaining line, attempt to JSON-parse it; if parsing fails, skip the line (do not retry, do not infer).
+
+For each successfully-parsed object, build a finding object that conforms to the findings schema:
+
+- **`title`**: the `title` field from the codex output.
+- **`severity`**: the `severity` field, one of `"P0"`, `"P1"`, `"P2"`, `"P3"`. Reject anything else; if codex emitted a different vocabulary, drop the object silently.
+- **`file`**: the `file` field. Verify the path appears in the diff's `Changed files` list from `<review-context>`; drop the finding if not.
+- **`line`**: the `line` field as an integer. The findings schema requires `line >= 1`, so reject any object with `line < 1` outright. When codex set `file_level: true`, line will already be `1` per the contract above; the agent does not re-default `0` to `1` because the contract refuses `0`. The `file_level` signal is preserved in `evidence` (see below) so synthesis and downstream surfaces can distinguish "line 1 was the issue" from "this is a file-level concern."
+- **`evidence`**: an array. Always include the `evidence` string from codex as the first element. When `file_level: true`, prepend an additional element: `"file-level finding (no specific line applies)"`. The schema requires evidence to be a non-empty array of strings; never emit a bare string.
+- **`why_it_matters`**: write 2-4 sentences explaining the observable consequence of the issue, grounded in the EVIDENCE codex provided. Lead with what a user, caller, or operator experiences. If you cannot articulate a concrete consequence from the codex output, drop the finding — the schema's why_it_matters bar is non-negotiable.
+- **`autofix_class`**: default `"manual"`. Cross-model findings carry interpretive uncertainty; the orchestrator's synthesis re-classifies during the merge step.
+- **`owner`**: default `"downstream-resolver"`.
+- **`requires_verification`**: `true`.
+- **`pre_existing`**: `false` unless the diff text shows the cited line was unchanged (a `-` or context line, not a `+`).
+- **`suggested_fix`**: include only when codex's output named a concrete change. Do not invent fixes.
+- **`confidence`**: use the anchored rubric per the subagent template. Codex outputs are second-opinion, so emit conservatively:
+  - **Anchor 50** — default for any codex finding you cannot independently verify against the diff and surrounding code. The orchestrator routes 50 to soft buckets unless the severity is P0.
+  - **Anchor 75** — codex named a specific file and line AND you can articulate the concrete observable consequence (the standard `75` bar: a wrong result, an unhandled error path, a contract mismatch, or missing coverage that a real test scenario would surface).
+  - **Anchor 100** — never emit. Codex is a second opinion, not direct verification of the kind that justifies absolute certainty.
+  - **Anchor 25 or below — suppress.** Drop the finding silently. The subagent template suppresses anchors `0` and `25` automatically; this persona enforces the same floor at translation time so noise does not enter synthesis.
+
+## Step 5: Output
+
+Return the findings as JSON matching the contract in the subagent template. Honor the artifact-file write if a Run ID is present: write the full analysis to `/tmp/compound-engineering/ce-code-review/{run_id}/ce-codex-reviewer.json` and return the compact form to the parent.
+
+The `reviewer` field at the top of your output is `"codex"` (matches the persona-catalog entry).
+
+Empty-findings response shape (used by all early-return paths in Steps 1-3):
+
+```json
+{
+  "reviewer": "codex",
+  "findings": [],
+  "residual_risks": ["<reason>"],
+  "testing_gaps": []
+}
+```
+
+## What you do not do
+
+- **Do not resolve a base branch or recompute the diff.** The orchestrator's diff is authoritative. Calling `git diff`, `git symbolic-ref`, or `git rev-parse` here was the failure mode flagged by Codex on PR #356 (P1 finding "Stop assuming ce:review passes base branch context"); the new subagent template provides the diff directly so the question does not arise.
+- **Do not suppress findings as "already covered by other personas."** You run as an independent parallel subagent and have no visibility into other reviewers' outputs. Synthesis dedupes findings centrally in Stage 5; suppressing here forces guesswork and silently drops valid issues.
+- **Do not emit findings about style, formatting, or linter-domain concerns.** Codex sometimes produces these; filter them out at translation time.
+- **Do not retry codex on failure.** A single non-zero exit returns the empty-findings JSON with the failure reason in `residual_risks` and stops.
diff --git a/plugins/compound-engineering/skills/ce-code-review/SKILL.md b/plugins/compound-engineering/skills/ce-code-review/SKILL.md
@@ -106,7 +106,7 @@ Routing rules:
 
 ## Reviewers
 
-18 reviewer personas in layered conditionals, plus CE-specific agents. See the persona catalog included below for the full catalog.
+19 reviewer personas in layered conditionals, plus CE-specific agents. See the persona catalog included below for the full catalog.
 
 **Always-on (every review):**
 
@@ -131,6 +131,7 @@ Routing rules:
 | `ce-adversarial-reviewer` | Diff >=50 changed non-test/non-generated/non-lockfile lines, or auth, payments, data mutations, external APIs |
 | `ce-cli-readiness-reviewer` | CLI command definitions, argument parsing, CLI framework usage, command handler implementations |
 | `ce-previous-comments-reviewer` | Reviewing a PR that has existing review comments or threads |
+| `ce-codex-reviewer` | Cross-model independent validation when the diff is non-trivial AND the user-level Codex CLI is available on PATH (>=25 changed non-test lines, or correctness-sensitive domains: auth, payments, data, parsing, concurrency, external API contracts) |
 
 **Stack-specific conditional (selected per diff):**
 

diff --git a/plugins/compound-engineering/skills/ce-code-review/references/persona-catalog.md b/plugins/compound-engineering/skills/ce-code-review/references/persona-catalog.md
@@ -1,6 +1,6 @@
 # Persona Catalog
 
-18 reviewer personas organized into always-on, cross-cutting conditional, and stack-specific conditional layers, plus CE-specific agents. The orchestrator uses this catalog to select which reviewers to spawn for each review.
+19 reviewer personas organized into always-on, cross-cutting conditional, and stack-specific conditional layers, plus CE-specific agents. The orchestrator uses this catalog to select which reviewers to spawn for each review.
 
 ## Always-on (4 personas + 2 CE agents)
 
@@ -22,7 +22,7 @@ Spawned on every review regardless of diff content.
 | `ce-agent-native-reviewer` | Verify new features are agent-accessible |
 | `ce-learnings-researcher` | Search docs/solutions/ for past issues related to this PR's modules and patterns |
 
-## Conditional (8 personas)
+## Conditional (9 personas)
 
 Spawned when the orchestrator identifies relevant patterns in the diff. The orchestrator reads the full diff and reasons about selection -- this is agent judgment, not keyword matching.
 
@@ -36,6 +36,7 @@ Spawned when the orchestrator identifies relevant patterns in the diff. The orch
 | `adversarial` | `ce-adversarial-reviewer` | Diff has >=50 changed non-test, non-generated, non-lockfile lines, OR touches auth, payments, data mutations, external API integrations, or other high-risk domains |
 | `cli-readiness` | `ce-cli-readiness-reviewer` | CLI command definitions, argument parsing, CLI framework usage, command handler implementations |
 | `previous-comments` | `ce-previous-comments-reviewer` | **PR-only.** Reviewing a PR that has existing review comments or review threads from prior review rounds. Skip entirely when no PR metadata was gathered in Stage 1. |
+| `codex` | `ce-codex-reviewer` | Cross-model independent validation when the diff is non-trivial (>= 25 changed non-test/non-generated lines, or touches correctness-sensitive domains: auth, payments, data mutations, parsing, concurrency, external API contracts) AND the user-level Codex CLI is available on PATH. Codex sees the same diff through a different model's reasoning and catches blind spots same-model reviewers share. Skip silently when codex is not installed (the agent handles its own environment guard). |
 
 ## Stack-Specific Conditional (6 personas)
 

diff --git a/tests/review-skill-contract.test.ts b/tests/review-skill-contract.test.ts
@@ -492,6 +492,7 @@ describe("ce-code-review contract", () => {
       "ce-julik-frontend-races-reviewer",
       "ce-swift-ios-reviewer",
       "ce-agent-native-reviewer",
+      "ce-codex-reviewer",
     ]
 
     for (const persona of personas) {
@@ -696,6 +697,111 @@ describe("ce-code-review contract", () => {
   })
 })
 
+describe("codex-reviewer contract", () => {
+  test("agent file exists with required frontmatter", async () => {
+    const content = await readRepoFile(
+      "plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
+    )
+    const { data } = parseFrontmatter(content)
+    expect(data.name).toBe("ce-codex-reviewer")
+    expect(typeof data.description).toBe("string")
+    expect(data.tools).toContain("Bash")
+  })
+
+  test("uses pre-computed diff from review context, not its own base-branch resolution", async () => {
+    const content = await readRepoFile(
+      "plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
+    )
+    // Addresses Codex P1 finding from PR #356 (March 2026): the agent must
+    // not assume the orchestrator passes a base branch. The new subagent
+    // template passes the pre-computed diff directly; the agent uses that.
+    expect(content).toContain("pre-computed diff")
+    expect(content).toMatch(/do not.*resolve.*base branch/i)
+    // Negative checks: must not call git diff / symbolic-ref / rev-parse
+    // for base-branch resolution at the top level of the agent's flow.
+    expect(content).not.toMatch(/git symbolic-ref refs\/remotes\/origin\/HEAD/)
+    expect(content).not.toMatch(/git rev-parse --verify.*main/)
+  })
+
+  test("does not suppress findings as 'already covered by other personas'", async () => {
+    const content = await readRepoFile(
+      "plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
+    )
+    // Addresses Codex P2 finding from PR #356 (March 2026): the agent runs
+    // as an independent parallel subagent and has no visibility into other
+    // reviewers' outputs. Synthesis dedupes centrally in Stage 5.
+    expect(content).toMatch(/synthesis dedupes findings centrally/i)
+    expect(content).not.toMatch(/Findings already covered by other personas/i)
+  })
+
+  test("registered in persona-catalog and SKILL.md dispatch table", async () => {
+    const catalog = await readRepoFile(
+      "plugins/compound-engineering/skills/ce-code-review/references/persona-catalog.md",
+    )
+    expect(catalog).toContain("`codex`")
+    expect(catalog).toContain("`ce-codex-reviewer`")
+
+    const skill = await readRepoFile(
+      "plugins/compound-engineering/skills/ce-code-review/SKILL.md",
+    )
+    expect(skill).toContain("`ce-codex-reviewer`")
+  })
+
+  test("environment guards run before any codex invocation", async () => {
+    const content = await readRepoFile(
+      "plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
+    )
+    // Two guards required by the design: already-inside-codex sandbox detection
+    // (CODEX_SANDBOX / CODEX_SESSION_ID) and codex CLI availability check.
+    expect(content).toContain("CODEX_SANDBOX")
+    expect(content).toContain("CODEX_SESSION_ID")
+    expect(content).toMatch(/which codex/)
+    // Both guards must fail closed with the empty-findings JSON.
+    expect(content).toContain("codex-reviewer skipped: already running inside Codex sandbox")
+    expect(content).toContain("codex-reviewer skipped: codex CLI not installed")
+  })
+
+  test("uses NDJSON output contract so evidence can carry pipes safely", async () => {
+    const content = await readRepoFile(
+      "plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
+    )
+    // Addresses Codex P2 finding on commit a8a5ed2 (April 2026): pipe-delimited
+    // output dropped any row that wasn't exactly five `|`-separated fields,
+    // so EVIDENCE containing a literal `|` (bitwise OR, shell pipes, markdown
+    // tables) silently lost real findings. NDJSON is robust to embedded pipes.
+    expect(content).toContain("NDJSON")
+    expect(content).toMatch(/JSON-?parse/i)
+    // Pipe-delimited contract must be gone from the agent's prose.
+    expect(content).not.toMatch(/SEVERITY\|FILE\|LINE\|TITLE\|EVIDENCE/)
+    expect(content).not.toMatch(/exactly five `\|`-separated fields/)
+  })
+
+  test("emits schema-valid line numbers for file-level findings", async () => {
+    const content = await readRepoFile(
+      "plugins/compound-engineering/agents/ce-codex-reviewer.agent.md",
+    )
+    // Addresses Codex P1 finding on commit a8a5ed2 (April 2026): the findings
+    // schema requires line >= 1, so emitting line=0 for file-level findings
+    // would silently drop them at the merge validator. Contract says: line=1
+    // with file_level=true, evidence array prepends the file-level annotation.
+    expect(content).toContain("file_level")
+    expect(content).toMatch(/positive integer/i)
+    expect(content).toMatch(/file-level finding \(no specific line applies\)/)
+    // Negative check: the old "0 means file-level" wording must be gone.
+    expect(content).not.toMatch(/0 means file-level/)
+    expect(content).not.toMatch(/or 0 if file-level/)
+
+    // Also assert the schema constraint is what we think it is — if upstream
+    // ever loosens line to allow 0, this test should be updated to match.
+    const schemaJson = await readRepoFile(
+      "plugins/compound-engineering/skills/ce-code-review/references/findings-schema.json",
+    )
+    const schema = JSON.parse(schemaJson)
+    const lineMin = schema.properties?.findings?.items?.properties?.line?.minimum
+    expect(lineMin).toBe(1)
+  })
+})
+
 describe("testing-reviewer contract", () => {
   test("includes behavioral-changes-with-no-test-additions check", async () => {
     const content = await readRepoFile("plugins/compound-engineering/agents/ce-testing-reviewer.agent.md")