Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions plugins/compound-engineering/skills/ce-code-review/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ Sequence:
- **Skip all user questions.** Never pause for approval or clarification once scope has been established.
- **Apply only `safe_auto -> review-fixer` findings.** Leave `gated_auto`, `manual`, `human`, and `release` work unresolved.
- **Write a run artifact** under `/tmp/compound-engineering/ce-code-review/<run-id>/` summarizing findings, applied fixes, residual actionable work, and advisory outputs. Orchestrators read this artifact to route residual `downstream-resolver` findings; the skill itself does not file tickets or prompt the user in autofix.
- **Emit a compact Residual Actionable Work summary in the autofix return** listing each residual `downstream-resolver` finding with severity, file:line, title, and autofix_class. Include the run-artifact path. Callers read this summary directly without parsing the artifact. When no residuals exist, state `Residual actionable work: none.` explicitly.
- **Emit a compact Residual Actionable Work summary in the autofix return** listing each residual `downstream-resolver` finding with its stable `#`, severity, file:line, title, and autofix_class. Structure the summary as two separate contiguous sections: applied `safe_auto` fixes first, then residual non-auto findings. Within the residual section, reuse each finding's stable `#` from Stage 5 -- never renumber. Include the run-artifact path. Callers read this summary directly without parsing the artifact. When no residuals exist, state `Residual actionable work: none.` explicitly.
- **Never commit, push, or create a PR** from autofix mode. Parent workflows own those decisions.

### Report-only mode rules
Expand Down Expand Up @@ -550,7 +550,7 @@ Demotion is intentionally narrow. The conservative scope (testing/maintainabilit
- in-skill fixer queue: only `safe_auto -> review-fixer`
- residual actionable queue: unresolved `gated_auto` or `manual` findings whose owner is `downstream-resolver`
- report-only queue: `advisory` findings plus anything owned by `human` or `release`
9. **Sort.** Order by severity (P0 first) -> anchor (descending) -> file path -> line number.
9. **Sort and number.** Order by severity (P0 first) -> anchor (descending) -> file path -> line number, then assign monotonically increasing `#` values across the full primary finding set in that sorted order. Do not restart numbering inside each severity table or autofix/routing bucket. If later sections repeat a finding (for example Residual Actionable Work after `safe_auto` fixes are applied), reuse the same stable `#` so users -- and downstream skills like `ce-resolve-pr-feedback` -- can reference findings by `#` after the autofix loop rewrites the report. Renumbering after autofix invalidates any prior reference: copied snippets, follow-up prompts citing `#3`, or tickets filed against an earlier render.
10. **Collect coverage data.** Union residual_risks and testing_gaps across reviewers.
11. **Preserve CE agent artifacts.** Keep the learnings, agent-native, schema-drift, and deployment-verification outputs alongside the merged finding set. Do not drop unstructured agent output just because it does not match the persona JSON schema.

Expand Down Expand Up @@ -600,7 +600,7 @@ When Stage 5b does not run, the merged finding set from Stage 5 flows through to
Assemble the final report using **pipe-delimited markdown tables for findings** from the review output template included below. The table format is mandatory for finding rows in interactive mode — do not render findings as freeform text blocks or horizontal-rule-separated prose. Other report sections (Applied Fixes, Learnings, Coverage, etc.) use bullet lists and the `---` separator before the verdict, as shown in the template.

1. **Header.** Scope, intent, mode, reviewer team with per-conditional justifications.
2. **Findings.** Rendered as pipe-delimited tables grouped by severity (`### P0 -- Critical`, `### P1 -- High`, `### P2 -- Moderate`, `### P3 -- Low`). Each finding row shows `#`, file, issue, reviewer(s), confidence, and synthesized route. Omit empty severity levels. Never render findings as freeform text blocks or numbered lists.
2. **Findings.** Rendered as pipe-delimited tables grouped by severity (`### P0 -- Critical`, `### P1 -- High`, `### P2 -- Moderate`, `### P3 -- Low`). Each finding row shows `#`, file, issue, reviewer(s), confidence, and synthesized route. Omit empty severity levels. Never render findings as freeform text blocks or numbered lists. Finding numbers come from the stable assignment in Stage 5 -- never re-derive them per severity table.
3. **Requirements Completeness.** Include only when a plan was found in Stage 2b. For each requirement (R1, R2, etc.) and implementation unit in the plan, report whether corresponding work appears in the diff. Use a simple checklist: met / not addressed / partially addressed. Routing depends on `plan_source`:
- **`explicit`** (caller-provided or PR body): Flag unaddressed requirements as P1 findings with `autofix_class: manual`, `owner: downstream-resolver`. These enter the residual actionable queue.
- **`inferred`** (auto-discovered): Flag unaddressed requirements as P3 findings with `autofix_class: advisory`, `owner: human`. These stay in the report only — no autonomous follow-up. An inferred plan match is a hint, not a contract.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ Use this **exact format** when presenting synthesized review findings. Findings
| # | File | Issue | Route | Next Step |
|---|------|-------|-------|-----------|
| 1 | `orders_controller.rb:42` | Ownership check missing on export lookup | `gated_auto -> downstream-resolver` | Defer via tracker (requires explicit approval before behavior change) |
| 2 | `export_service.rb:91` | Pagination contract needs a broader API decision | `manual -> downstream-resolver` | Defer via tracker with contract and client impact details |
| 3 | `export_service.rb:91` | Pagination contract needs a broader API decision | `manual -> downstream-resolver` | Defer via tracker with contract and client impact details |

### Pre-existing Issues

Expand Down Expand Up @@ -117,6 +117,7 @@ This fails because: no pipe-delimited tables, no severity-grouped `###` headers,

- **Pipe-delimited markdown tables** for findings -- never ASCII box-drawing characters or per-finding horizontal-rule separators between entries (the report-level `---` before the verdict is still required)
- **Severity-grouped sections** -- `### P0 -- Critical`, `### P1 -- High`, `### P2 -- Moderate`, `### P3 -- Low`. Omit empty severity levels.
- **Stable sequential finding numbers** -- assign finding numbers once after sorting, continue them across severity sections, and reuse those same numbers when findings are repeated in Residual Actionable Work. Do not restart at `1` for each severity or route bucket.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep residual example IDs aligned with stable numbering

The new rule requires reusing stable finding numbers, but the example in this template still renumbers Residual Actionable Work entries (1, 2) instead of preserving original IDs from the findings table (1, 3 when #2 was auto-fixed). Because this file is used as the canonical output pattern, the contradiction can cause the reviewer output to keep remapping IDs, which breaks downstream correlation for residual work even though Stage 5 now defines stable numbering.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show the rule in the example tables, don't just declare it as a bullet.

This template already teaches by example -- good rendering above (~L70-100) and bad rendering (~L100-114). The new rule appears only as a declarative bullet here. LLMs rendering the report lean on the example more than the bullet.

Two concrete updates to the example section above:

  1. Update the existing P1 / P2 sample tables so the P2 section starts at #3 (not #1), continuing from the P1 section's two findings -- demonstrating the cross-severity continuation.
  2. Add a small Residual Actionable Work mini-table below that re-cites one of those # values, demonstrating the post-autofix reuse.

Without an example, the rule is easy to read past.

- **Always include file:line location** for code review issues
- **Reviewer column** shows which persona(s) flagged the issue. Multiple reviewers = cross-reviewer agreement.
- **Confidence column** shows the finding's anchor as an integer (`50`, `75`, or `100`). Never render as a float.
Expand Down
31 changes: 31 additions & 0 deletions tests/fixtures/ce-code-review-stable-numbering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
## Code Review Results

**Scope:** merge-base with main -> working tree
**Intent:** Demonstrate stable finding numbering
**Mode:** autofix

**Reviewers:** correctness, testing, maintainability

### P1 -- High

| # | File | Issue | Reviewer | Confidence | Route |
|---|------|-------|----------|------------|-------|
| 1 | `export_service.rb:87` | Loads all orders into memory | performance | 100 | `safe_auto -> review-fixer` |
| 2 | `export_service.rb:91` | Missing pagination contract | api-contract | 75 | `manual -> downstream-resolver` |

### P2 -- Moderate

| # | File | Issue | Reviewer | Confidence | Route |
|---|------|-------|----------|------------|-------|
| 3 | `export_service.rb:45` | Missing error handling | correctness | 75 | `gated_auto -> downstream-resolver` |

### Applied Fixes

- `safe_auto`: Applied bounded export loading fix for #1.

### Residual Actionable Work

| # | File | Issue | Route | Next Step |
|---|------|-------|-------|-----------|
| 2 | `export_service.rb:91` | Missing pagination contract | `manual -> downstream-resolver` | Defer via tracker with API contract context |
| 3 | `export_service.rb:45` | Missing error handling | `gated_auto -> downstream-resolver` | Defer via tracker pending behavior approval |
38 changes: 38 additions & 0 deletions tests/review-skill-contract.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -731,8 +731,46 @@ describe("ce-code-review contract", () => {
test("ce-code-review autofix emits a residual-work summary in-chat, not only in the artifact", async () => {
const content = await readRepoFile("plugins/compound-engineering/skills/ce-code-review/SKILL.md")
expect(content).toMatch(/Emit a compact Residual Actionable Work summary/)
expect(content).toContain("with its stable `#`, severity, file:line, title, and autofix_class")
expect(content).toContain("Structure the summary as two separate contiguous sections")
expect(content).toContain("applied `safe_auto` fixes first, then residual non-auto findings")
expect(content).toContain("reuse each finding's stable `#` from Stage 5 -- never renumber")
expect(content).toContain("Residual actionable work: none.")
})

test("ce-code-review uses stable sequential finding numbers across grouped output", async () => {
const content = await readRepoFile("plugins/compound-engineering/skills/ce-code-review/SKILL.md")
const template = await readRepoFile(
"plugins/compound-engineering/skills/ce-code-review/references/review-output-template.md",
)
const fixture = await readRepoFile("tests/fixtures/ce-code-review-stable-numbering.md")

const stage5 = content.split("### Stage 5b:")[0].split("### Stage 5:")[1]
expect(stage5).toMatch(/Sort and number/)
expect(stage5).toMatch(/Do not restart numbering inside each severity table or autofix\/routing bucket/)
expect(stage5).toMatch(/reuse the same stable `#`/)
expect(stage5).toMatch(/ce-resolve-pr-feedback/)

const stage6 = content.split("### Headless output format")[0].split("### Stage 6: Synthesize and present")[1]
expect(stage6).toContain("Finding numbers come from the stable assignment in Stage 5")
expect(stage6).toContain("never re-derive them per severity table")
expect(template).toContain("Stable sequential finding numbers")
expect(template).toContain("reuse those same numbers when findings are repeated in Residual Actionable Work")

const primaryFindingIds = Array.from(
fixture.matchAll(/^\| (\d+) \| `[^`]+` \| .* \| .* \| \d+ \| `.*` \|$/gm),
([, id]) => Number(id),
)
expect(primaryFindingIds).toEqual([1, 2, 3])

const residualSection = fixture.split("### Residual Actionable Work")[1]
const residualIds = Array.from(
residualSection.matchAll(/^\| (\d+) \| `[^`]+` \| .* \| `.*` \| .* \|$/gm),
([, id]) => Number(id),
)
expect(residualIds).toEqual([2, 3])
expect(residualIds.every((id) => primaryFindingIds.includes(id))).toBe(true)
})
Comment on lines +741 to +773
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertions are too loose -- tighten location anchoring and consider one behavioral check.

Two issues with the current test:

  1. expect(content).toContain("Assign stable finding numbers once") passes even if someone moves the rule into a stale comment, deletes the surrounding step, or adds contradictory guidance elsewhere. The test confirms the rule is written, not that it lives in the right place.
  2. It checks prose presence, not behavior. Compare to ce-plan's Phase 6 checklist (SKILL.md:802) which asserts structural invariants ("U-IDs are unique within the plan and follow the stability rule").

Two concrete improvements -- either is worth doing; both is better:

(a) Anchor the prose assertions to their location by slicing the file around the Stage 5 / Stage 5b headings so the test fails loudly if the rule is moved out of context:

const stage5 = content.split("### Stage 5b:")[0].split("### Stage 5:")[1]
expect(stage5).toMatch(/Assign stable finding numbers once/)
expect(stage5).toMatch(/reuse the same stable `#`/)

(b) Add at least one behavioral check. Even if the rendering itself lives in skill prose (LLM-driven), you can add a fixture in tests/fixtures/ showing a correctly-rendered multi-severity report and assert:

  • numbers across ### P0, ### P1, ### P2, ### P3 tables are strictly monotonic (never reset to 1)
  • a finding repeated in a Residual Actionable Work section retains the same #

A fixture-based assertion exercises a concrete artifact rather than a string in skill prose, and would catch the actual failure mode the rule is trying to prevent.

})

describe("testing-reviewer contract", () => {
Expand Down