fix(ce-code-review): keep finding numbers stable#754
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 107fca76d2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| - **Pipe-delimited markdown tables** for findings -- never ASCII box-drawing characters or per-finding horizontal-rule separators between entries (the report-level `---` before the verdict is still required) | ||
| - **Severity-grouped sections** -- `### P0 -- Critical`, `### P1 -- High`, `### P2 -- Moderate`, `### P3 -- Low`. Omit empty severity levels. | ||
| - **Stable sequential finding numbers** -- assign finding numbers once after sorting, continue them across severity sections, and reuse those same numbers when findings are repeated in Residual Actionable Work. Do not restart at `1` for each severity or route bucket. |
There was a problem hiding this comment.
Keep residual example IDs aligned with stable numbering
The new rule requires reusing stable finding numbers, but the example in this template still renumbers Residual Actionable Work entries (1, 2) instead of preserving original IDs from the findings table (1, 3 when #2 was auto-fixed). Because this file is used as the canonical output pattern, the contradiction can cause the reviewer output to keep remapping IDs, which breaks downstream correlation for residual work even though Stage 5 now defines stable numbering.
Useful? React with 👍 / 👎.
tmchow
left a comment
There was a problem hiding this comment.
Thanks for tackling this -- the core idea (assign once, reuse downstream) is right. Comparing against how ce-plan and ce-brainstorm handle stable IDs (R-IDs, U-IDs, etc.) surfaced a few places this could be tightened. Inline comments below; the priority items are the two on SKILL.md line 554 (step renumbering / fold into step 9) and the test loosening on the new contract test. Everything else is polish.
For reference, the durable-ID pattern those skills use lives in plugins/compound-engineering/skills/ce-plan/SKILL.md:401 and plugins/compound-engineering/skills/ce-plan/references/deepening-workflow.md:230 -- worth a skim if you haven't already, since the language patterns there transfer well.
| - residual actionable queue: unresolved `gated_auto` or `manual` findings whose owner is `downstream-resolver` | ||
| - report-only queue: `advisory` findings plus anything owned by `human` or `release` | ||
| 9. **Sort.** Order by severity (P0 first) -> anchor (descending) -> file path -> line number. | ||
| 9b. **Assign stable finding numbers once.** After sorting, assign monotonically increasing `#` values across the full primary finding set. Do not restart numbering inside each severity table or autofix/routing bucket. If later sections repeat a finding (for example Residual Actionable Work after `safe_auto` fixes are applied), reuse the same stable `#` so users can copy the non-auto-fixable work without remapping IDs. |
There was a problem hiding this comment.
Drop the 9b. workaround -- renumber, or fold into step 9.
Adding 9b. between existing steps 9 and 10 looks like a workaround to avoid renumbering 10. Collect coverage data and 11. Preserve CE agent artifacts. But these are private pipeline step numbers, not external references -- there's no ce-work analog citing them. The stability discipline ce-plan teaches applies to user-facing R/U IDs, not internal step numbers.
Pick one:
-
Preferred: renumber to
10. **Assign stable finding numbers once.**and bump the existing 10/11 to 11/12. -
Even better: fold into existing step 9. Sorting and ID assignment are conceptually one operation -- the stable
#is the post-sort position. Rewrite step 9 as:9. **Sort and number.** Order by severity (P0 first) -> anchor (descending) -> file path -> line number, then assign monotonically increasing # values across the full primary finding set in that sorted order.
The folded version is clearer because it makes the dependency explicit: numbering follows from the sort key, so they belong together.
| - residual actionable queue: unresolved `gated_auto` or `manual` findings whose owner is `downstream-resolver` | ||
| - report-only queue: `advisory` findings plus anything owned by `human` or `release` | ||
| 9. **Sort.** Order by severity (P0 first) -> anchor (descending) -> file path -> line number. | ||
| 9b. **Assign stable finding numbers once.** After sorting, assign monotonically increasing `#` values across the full primary finding set. Do not restart numbering inside each severity table or autofix/routing bucket. If later sections repeat a finding (for example Residual Actionable Work after `safe_auto` fixes are applied), reuse the same stable `#` so users can copy the non-auto-fixable work without remapping IDs. |
There was a problem hiding this comment.
Name the failure mode in prose.
Current text ends with "so users can copy the non-auto-fixable work without remapping IDs." That's generic. Compare to ce-plan's analogous rule (plugins/compound-engineering/skills/ce-plan/SKILL.md:401 and the deepening-vector callout in references/deepening-workflow.md:230,244), which names a specific consumer ("breaks ce-work blocker and verification references"). Suggested replacement:
| 9b. **Assign stable finding numbers once.** After sorting, assign monotonically increasing `#` values across the full primary finding set. Do not restart numbering inside each severity table or autofix/routing bucket. If later sections repeat a finding (for example Residual Actionable Work after `safe_auto` fixes are applied), reuse the same stable `#` so users can copy the non-auto-fixable work without remapping IDs. | |
| 9b. **Assign stable finding numbers once.** After sorting, assign monotonically increasing `#` values across the full primary finding set. Do not restart numbering inside each severity table or autofix/routing bucket. If later sections repeat a finding (for example Residual Actionable Work after `safe_auto` fixes are applied), reuse the same stable `#` so users -- and downstream skills like `ce-resolve-pr-feedback` -- can reference findings by `#` after the autofix loop rewrites the report. Renumbering after autofix invalidates any prior reference: copied snippets, follow-up prompts citing `#3`, or tickets filed against an earlier render. |
Naming the concrete failure mode lets the next reader judge edge cases the rule didn't anticipate.
|
|
||
| 1. **Header.** Scope, intent, mode, reviewer team with per-conditional justifications. | ||
| 2. **Findings.** Rendered as pipe-delimited tables grouped by severity (`### P0 -- Critical`, `### P1 -- High`, `### P2 -- Moderate`, `### P3 -- Low`). Each finding row shows `#`, file, issue, reviewer(s), confidence, and synthesized route. Omit empty severity levels. Never render findings as freeform text blocks or numbered lists. | ||
| 2. **Findings.** Rendered as pipe-delimited tables grouped by severity (`### P0 -- Critical`, `### P1 -- High`, `### P2 -- Moderate`, `### P3 -- Low`). Each finding row shows `#`, file, issue, reviewer(s), confidence, and synthesized route. Omit empty severity levels. Never render findings as freeform text blocks or numbered lists. Finding numbers must come from Stage 5's stable numbering and continue sequentially across severity sections; never restart at `1` for a new severity header. |
There was a problem hiding this comment.
Collapse the duplicated rule -- point back to Stage 5 instead of restating.
This sentence restates the rule already established in step 9b: "Finding numbers must come from Stage 5's stable numbering and continue sequentially across severity sections; never restart at 1 for a new severity header."
ce-plan keeps its stability rule in one canonical paragraph (SKILL.md:401) and references it from elsewhere -- so future edits have one home, not three (autofix bullet at L70, Stage 5 step, this Stage 6 rendering bullet). Suggested replacement:
| 2. **Findings.** Rendered as pipe-delimited tables grouped by severity (`### P0 -- Critical`, `### P1 -- High`, `### P2 -- Moderate`, `### P3 -- Low`). Each finding row shows `#`, file, issue, reviewer(s), confidence, and synthesized route. Omit empty severity levels. Never render findings as freeform text blocks or numbered lists. Finding numbers must come from Stage 5's stable numbering and continue sequentially across severity sections; never restart at `1` for a new severity header. | |
| 2. **Findings.** Rendered as pipe-delimited tables grouped by severity (`### P0 -- Critical`, `### P1 -- High`, `### P2 -- Moderate`, `### P3 -- Low`). Each finding row shows `#`, file, issue, reviewer(s), confidence, and synthesized route. Omit empty severity levels. Never render findings as freeform text blocks or numbered lists. Finding numbers come from the stable assignment in Stage 5 -- never re-derive them per severity table. |
| - **Apply only `safe_auto -> review-fixer` findings.** Leave `gated_auto`, `manual`, `human`, and `release` work unresolved. | ||
| - **Write a run artifact** under `/tmp/compound-engineering/ce-code-review/<run-id>/` summarizing findings, applied fixes, residual actionable work, and advisory outputs. Orchestrators read this artifact to route residual `downstream-resolver` findings; the skill itself does not file tickets or prompt the user in autofix. | ||
| - **Emit a compact Residual Actionable Work summary in the autofix return** listing each residual `downstream-resolver` finding with severity, file:line, title, and autofix_class. Include the run-artifact path. Callers read this summary directly without parsing the artifact. When no residuals exist, state `Residual actionable work: none.` explicitly. | ||
| - **Emit a compact Residual Actionable Work summary in the autofix return** listing each residual `downstream-resolver` finding with its stable `#`, severity, file:line, title, and autofix_class. Keep applied `safe_auto` fixes and residual non-auto findings in separate contiguous sections, and reuse the stable finding numbers from Stage 5 rather than renumbering residuals. Include the run-artifact path. Callers read this summary directly without parsing the artifact. When no residuals exist, state `Residual actionable work: none.` explicitly. |
There was a problem hiding this comment.
Minor: split the two unrelated rules into separate sentences.
The new clause mixes section structure ("keep applied fixes and residuals in separate contiguous sections") with ID reuse ("reuse stable finding numbers") into one sentence. Splitting reads more cleanly:
"Structure the summary as two separate contiguous sections: applied
safe_autofixes first, then residual non-auto findings. Within the residual section, reuse each finding's stable#from Stage 5 -- never renumber."
|
|
||
| - **Pipe-delimited markdown tables** for findings -- never ASCII box-drawing characters or per-finding horizontal-rule separators between entries (the report-level `---` before the verdict is still required) | ||
| - **Severity-grouped sections** -- `### P0 -- Critical`, `### P1 -- High`, `### P2 -- Moderate`, `### P3 -- Low`. Omit empty severity levels. | ||
| - **Stable sequential finding numbers** -- assign finding numbers once after sorting, continue them across severity sections, and reuse those same numbers when findings are repeated in Residual Actionable Work. Do not restart at `1` for each severity or route bucket. |
There was a problem hiding this comment.
Show the rule in the example tables, don't just declare it as a bullet.
This template already teaches by example -- good rendering above (~L70-100) and bad rendering (~L100-114). The new rule appears only as a declarative bullet here. LLMs rendering the report lean on the example more than the bullet.
Two concrete updates to the example section above:
- Update the existing P1 / P2 sample tables so the P2 section starts at
#3(not#1), continuing from the P1 section's two findings -- demonstrating the cross-severity continuation. - Add a small Residual Actionable Work mini-table below that re-cites one of those
#values, demonstrating the post-autofix reuse.
Without an example, the rule is easy to read past.
| test("ce-code-review uses stable sequential finding numbers across grouped output", async () => { | ||
| const content = await readRepoFile("plugins/compound-engineering/skills/ce-code-review/SKILL.md") | ||
| const template = await readRepoFile( | ||
| "plugins/compound-engineering/skills/ce-code-review/references/review-output-template.md", | ||
| ) | ||
|
|
||
| expect(content).toContain("Assign stable finding numbers once") | ||
| expect(content).toContain("Do not restart numbering inside each severity table or autofix/routing bucket") | ||
| expect(content).toContain("reuse the same stable `#`") | ||
| expect(content).toContain("never restart at `1` for a new severity header") | ||
| expect(template).toContain("Stable sequential finding numbers") | ||
| expect(template).toContain("reuse those same numbers when findings are repeated in Residual Actionable Work") | ||
| }) |
There was a problem hiding this comment.
Assertions are too loose -- tighten location anchoring and consider one behavioral check.
Two issues with the current test:
expect(content).toContain("Assign stable finding numbers once")passes even if someone moves the rule into a stale comment, deletes the surrounding step, or adds contradictory guidance elsewhere. The test confirms the rule is written, not that it lives in the right place.- It checks prose presence, not behavior. Compare to
ce-plan's Phase 6 checklist (SKILL.md:802) which asserts structural invariants ("U-IDs are unique within the plan and follow the stability rule").
Two concrete improvements -- either is worth doing; both is better:
(a) Anchor the prose assertions to their location by slicing the file around the Stage 5 / Stage 5b headings so the test fails loudly if the rule is moved out of context:
const stage5 = content.split("### Stage 5b:")[0].split("### Stage 5:")[1]
expect(stage5).toMatch(/Assign stable finding numbers once/)
expect(stage5).toMatch(/reuse the same stable `#`/)(b) Add at least one behavioral check. Even if the rendering itself lives in skill prose (LLM-driven), you can add a fixture in tests/fixtures/ showing a correctly-rendered multi-severity report and assert:
- numbers across
### P0,### P1,### P2,### P3tables are strictly monotonic (never reset to 1) - a finding repeated in a Residual Actionable Work section retains the same
#
A fixture-based assertion exercises a concrete artifact rather than a string in skill prose, and would catch the actual failure mode the rule is trying to prevent.
tmchow
left a comment
There was a problem hiding this comment.
feedback to address posted
Summary
Fixes #743.
Test plan
bun test tests/review-skill-contract.test.ts tests/pipeline-review-contract.test.tsbun run release:validategit diff --check