fix(ce-code-review): tighten autofix_class rubric for safe_auto/gated_auto boundary by tmchow · Pull Request #695 · EveryInc/compound-engineering-plugin

tmchow · 2026-04-26T03:07:37Z

Closes #686.

TL;DR

Issue #686 hypothesized that personas under-classify findings as safe_auto, making ce-work's headless auto-apply weaker than it could be. The hypothesis turned out to be approximately wrong — a 60-trial synthetic-fixture eval shows the post-#685 baseline already classifies textbook mechanical cases as safe_auto. 6 of 9 fixture shapes are identical across baseline and tightened rubric.

The actual win the eval surfaced is variance reduction on ambiguous cases, not a safe_auto-rate increase. Headline: orphan code without explicit "no callers" annotation. Baseline produced 3 different classifications across 4 trials on the same input (essentially random); tightened pins it deterministically to gated_auto.

What changed

File	Lines	Description
`subagent-template.md`	+14 / −6	Decision-guide expansion: one-sentence symmetry-of-error framing, operational test for `safe_auto`, four "boundary cases that often feel risky but are still safe_auto" examples, anti-default guard for `gated_auto` parallel to existing one for `advisory`.
`findings-schema.json`	+1 / −1	`autofix_class` field description rewritten to mirror the operational test from the subagent template.
`tests/review-skill-contract.test.ts`	+28 / 0	New test asserting the rubric's new structural elements (boundary cases, operational test phrasing, anti-default guards).
`docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md`	+127 / 0	Calibration writeup: eval methodology, 60-trial results matrix, the methodological lesson (calibrations should be evaluated for variance reduction first, rate-shift second), reproducibility notes.

Eval results — 60 trials, 9 fixture shapes

Fixture	Baseline	Tightened	Verdict
F1 — internal nil guard	3/3 safe_auto	3/3 safe_auto	identical
F1b — cart `min_by` semantic bug	3/3 safe_auto	3/3 safe_auto	identical
F2 — off-by-one + parallel pattern	3/3 safe_auto	3/3 safe_auto	identical
F3 — dead code w/ "no callers" comment	3/3 safe_auto	3/3 safe_auto	identical
F4 — local helper extract	2/3 safe_auto, 1/3 advisory	3/3 safe_auto	tightened reduces variance
F3b — orphan code, no explicit comment	manual / safe_auto / gated_auto / safe_auto (4 trials, 3 classes)	7/7 gated_auto	tightened dramatically reduces variance
F4b — cross-file Rails service extract	4/4 safe_auto	6/7 gated_auto, 1/7 advisory	stable disagreement, both defensible
F5 — missing test for new public method	3/3 safe_auto	3/3 safe_auto	identical
F6 — admin auth gate (negative control)	1/1 gated_auto	1/1 gated_auto	identical (correctly stable)

The F4b trade-off

Cross-file extraction of two service objects with identical bodies. Baseline picks safe_auto (matches "extracting a duplicated helper" example). Tightened picks gated_auto (matches "naming/placement requires a design conversation" — Rails service-layering placement is a real architectural call). Both are internally consistent.

Tightened picks the conservative reading, meaning ce-work's headless will flag cross-file extraction for user review instead of auto-applying it. For careful operators that's the right call; for autonomous bulk-refactor flows it's modestly more friction. Documented as a known trade-off in the calibration writeup.

What this PR doesn't do

The eval is single-persona on synthetic fixtures. Real reviews run multiple personas through synthesis with conservative tie-breaks; synthesis-layer effects could amplify or dampen what the persona-side eval shows. If a safe_auto underclassification incident recurs on a real branch (the original "8 findings to tickets" story), that's evidence for another iteration.

Test plan

bun test — 910/910 pass (1 new contract test, 12 new assertions)
bun run release:validate — clean
Confirmed: the new rubric language is referenced in tests so future drift fails the contract test

References

fix(ce-code-review): replace LFG with best-judgment auto-resolve #685 — fix(ce-code-review): replace LFG with best-judgment auto-resolve (the suggested_fix push this builds on)
docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md — full calibration writeup
docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md — the anchored confidence rubric this shares stylistic conventions with

🤖 Generated with Claude Code

…_auto boundary Issue #686 hypothesized that personas under-classify findings as `safe_auto`, making `ce-work`'s headless auto-apply weaker than it could be. A 60-trial synthetic-fixture eval (workspace at /tmp/safe-auto-eval/) found the hypothesis approximately wrong: post-#685 personas already classify textbook mechanical cases (nil guards, off-by-ones with parallel patterns, explicit dead code, local helper extraction, missing tests) as `safe_auto`. 6 of 9 fixture shapes show identical classification across baseline and tightened rubric. What the rubric tightening actually does is reduce VARIANCE on cases where the previous wording was genuinely ambiguous. The headline win is on orphan code without explicit "no callers" annotation: baseline rubric produced manual / safe_auto / gated_auto across 4 trials on the same input (essentially random); tightened rubric pins it deterministically to gated_auto by giving the persona a clearer test ("the surrounding refactor obviously displaces it" requires positive signal, which absence-of-comment fixtures lack). The trade-off: cross-file Rails service extraction goes from baseline `safe_auto` (4/4) to tightened `gated_auto` (6/7). Both classifications are internally defensible — the baseline's matches the rubric's "extracting a duplicated helper" example; the tightened catches "Rails service-layering placement is a design conversation." The tightened picks the more conservative reading, matching what a careful operator would want before auto-applying a cross-file architectural extraction. Net effect: variance reduction on ambiguous fixtures, no movement on textbook ones, one stable defensible disagreement on cross-file extraction. Two files changed: - subagent-template.md autofix_class decision guide (~138-160): added one-sentence symmetry-of-error framing, an operational test for `safe_auto` (one-sentence fix, no "depends on" clauses, no contract / permission / signature / module-boundary change), four "boundary cases that often feel risky but are still safe_auto" examples (nil guards, off-by-ones, dead code, helper extraction with the cross-file discriminator), and a "do not default to gated_auto" anti-pattern guard parallel to the existing "do not default to advisory" guard. - findings-schema.json autofix_class field description: replaced terse "Reviewer's conservative recommendation" with operational summary mirroring the subagent-template wording. Tests: 910/910 pass (1 new test added with 12 assertions for the new rubric language). release:validate clean. Calibration writeup: docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md documents the eval methodology, results, and the methodological lesson (rubric calibrations should be evaluated for variance reduction first, classification-rate-shift second). Closes #686 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…bration writeup /ce-compound surfaced two methodological lessons from this calibration that generalize beyond safe_auto: (1) measure variance reduction, not just classification-rate-shift, when evaluating a persona-rubric prompt change; (2) a synthetic-fixture eval harness with N>=3 trials per cell is the right tier between "ship and watch" and "stare at the diff." The Related Docs Finder scored both candidate new docs at 5/5 HIGH overlap with this writeup, so per /ce-compound's overlap rule the right move is to fold the content in here rather than create new files that would inevitably drift apart. Restructured the writeup to: - Promote "Why this writeup matters more than the prompt change" to a named "Methodological lesson 1: variance reduction beats classification-rate-shift" section with the full N=1-misleads argument, three-tier evidence hierarchy, worked F3b example showing how three N=1 reads can produce three contradictory stories on the same fixture, and practical N>=3 rules. - Promote "Eval reproducibility" to a named "Methodological lesson 2: validating persona-rubric prompt changes before shipping" section with the workspace pattern, persona-runner contract, fixture-matrix taxonomy (textbook positive / textbook negative / negative control / ambiguous boundary / stable-disagreement candidate), and step-by-step apply guide generalized beyond safe_auto. - Add cross-reference to ce-doc-review-calibration-patterns-2026-04-19.md's "Reviewer variance is inherent" section as the precedent in this repo for the variance-as-noise warning. - Add `last_updated: 2026-04-25` and additional tags (eval-methodology, variance) per the /ce-compound update convention. Tests: 22/22 ce-code-review contract pass. release:validate clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0011b0548a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…omment truncation Codex review on PR #695 caught that `related_pr: PR #685 ...` in the calibration writeup parses as just `'PR'` — YAML treats unquoted ' #' as a comment delimiter, so everything from the `#` onward was silently dropped. Any tooling that indexes or renders `related_pr` would lose the linkage. Fix: quote the value. Repo-wide sweep for the same pattern in docs/ frontmatter found one other instance with the same risk (and an additional unquoted-colon issue compounding it): docs/plans/2026-04-16-001-fix-ce-polish-beta-detection-gaps-plan.md. Quoted that one too while in here. Verified: both files now parse correctly via yaml.safe_load. Re-sweep across docs/ frontmatter shows zero remaining ' #' instances. Tests: 910/910 pass. release:validate clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…_auto boundary (EveryInc#695) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tmchow and others added 2 commits April 25, 2026 20:07

chatgpt-codex-connector Bot reviewed Apr 26, 2026

View reviewed changes

Comment thread docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md Outdated

tmchow mentioned this pull request Apr 26, 2026

ce-session-historian inefficient on sparse-history dispatches (17min wall, 33 tool calls) #696

Closed

tmchow merged commit ad9577e into main Apr 26, 2026
2 checks passed

github-actions Bot mentioned this pull request Apr 26, 2026

chore: release main #684

Merged

tmchow mentioned this pull request Apr 26, 2026

feat(ce-compound): add frontmatter parser-safety validator #697

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ce-code-review): tighten autofix_class rubric for safe_auto/gated_auto boundary#695

fix(ce-code-review): tighten autofix_class rubric for safe_auto/gated_auto boundary#695
tmchow merged 3 commits intomainfrom
tmchow/safe-auto-calibration

tmchow commented Apr 26, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tmchow commented Apr 26, 2026

TL;DR

What changed

Eval results — 60 trials, 9 fixture shapes

The F4b trade-off

What this PR doesn't do

Test plan

References

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant