fix(ce-code-review): tighten autofix_class rubric for safe_auto/gated_auto boundary#695
Merged
fix(ce-code-review): tighten autofix_class rubric for safe_auto/gated_auto boundary#695
Conversation
…_auto boundary Issue #686 hypothesized that personas under-classify findings as `safe_auto`, making `ce-work`'s headless auto-apply weaker than it could be. A 60-trial synthetic-fixture eval (workspace at /tmp/safe-auto-eval/) found the hypothesis approximately wrong: post-#685 personas already classify textbook mechanical cases (nil guards, off-by-ones with parallel patterns, explicit dead code, local helper extraction, missing tests) as `safe_auto`. 6 of 9 fixture shapes show identical classification across baseline and tightened rubric. What the rubric tightening actually does is reduce VARIANCE on cases where the previous wording was genuinely ambiguous. The headline win is on orphan code without explicit "no callers" annotation: baseline rubric produced manual / safe_auto / gated_auto across 4 trials on the same input (essentially random); tightened rubric pins it deterministically to gated_auto by giving the persona a clearer test ("the surrounding refactor obviously displaces it" requires positive signal, which absence-of-comment fixtures lack). The trade-off: cross-file Rails service extraction goes from baseline `safe_auto` (4/4) to tightened `gated_auto` (6/7). Both classifications are internally defensible — the baseline's matches the rubric's "extracting a duplicated helper" example; the tightened catches "Rails service-layering placement is a design conversation." The tightened picks the more conservative reading, matching what a careful operator would want before auto-applying a cross-file architectural extraction. Net effect: variance reduction on ambiguous fixtures, no movement on textbook ones, one stable defensible disagreement on cross-file extraction. Two files changed: - subagent-template.md autofix_class decision guide (~138-160): added one-sentence symmetry-of-error framing, an operational test for `safe_auto` (one-sentence fix, no "depends on" clauses, no contract / permission / signature / module-boundary change), four "boundary cases that often feel risky but are still safe_auto" examples (nil guards, off-by-ones, dead code, helper extraction with the cross-file discriminator), and a "do not default to gated_auto" anti-pattern guard parallel to the existing "do not default to advisory" guard. - findings-schema.json autofix_class field description: replaced terse "Reviewer's conservative recommendation" with operational summary mirroring the subagent-template wording. Tests: 910/910 pass (1 new test added with 12 assertions for the new rubric language). release:validate clean. Calibration writeup: docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md documents the eval methodology, results, and the methodological lesson (rubric calibrations should be evaluated for variance reduction first, classification-rate-shift second). Closes #686 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bration writeup /ce-compound surfaced two methodological lessons from this calibration that generalize beyond safe_auto: (1) measure variance reduction, not just classification-rate-shift, when evaluating a persona-rubric prompt change; (2) a synthetic-fixture eval harness with N>=3 trials per cell is the right tier between "ship and watch" and "stare at the diff." The Related Docs Finder scored both candidate new docs at 5/5 HIGH overlap with this writeup, so per /ce-compound's overlap rule the right move is to fold the content in here rather than create new files that would inevitably drift apart. Restructured the writeup to: - Promote "Why this writeup matters more than the prompt change" to a named "Methodological lesson 1: variance reduction beats classification-rate-shift" section with the full N=1-misleads argument, three-tier evidence hierarchy, worked F3b example showing how three N=1 reads can produce three contradictory stories on the same fixture, and practical N>=3 rules. - Promote "Eval reproducibility" to a named "Methodological lesson 2: validating persona-rubric prompt changes before shipping" section with the workspace pattern, persona-runner contract, fixture-matrix taxonomy (textbook positive / textbook negative / negative control / ambiguous boundary / stable-disagreement candidate), and step-by-step apply guide generalized beyond safe_auto. - Add cross-reference to ce-doc-review-calibration-patterns-2026-04-19.md's "Reviewer variance is inherent" section as the precedent in this repo for the variance-as-noise warning. - Add `last_updated: 2026-04-25` and additional tags (eval-methodology, variance) per the /ce-compound update convention. Tests: 22/22 ce-code-review contract pass. release:validate clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0011b0548a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…omment truncation Codex review on PR #695 caught that `related_pr: PR #685 ...` in the calibration writeup parses as just `'PR'` — YAML treats unquoted ' #' as a comment delimiter, so everything from the `#` onward was silently dropped. Any tooling that indexes or renders `related_pr` would lose the linkage. Fix: quote the value. Repo-wide sweep for the same pattern in docs/ frontmatter found one other instance with the same risk (and an additional unquoted-colon issue compounding it): docs/plans/2026-04-16-001-fix-ce-polish-beta-detection-gaps-plan.md. Quoted that one too while in here. Verified: both files now parse correctly via yaml.safe_load. Re-sweep across docs/ frontmatter shows zero remaining ' #' instances. Tests: 910/910 pass. release:validate clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
michaelvolz
pushed a commit
to michaelvolz/compound-engineering-plugin-windows-version
that referenced
this pull request
Apr 28, 2026
…_auto boundary (EveryInc#695) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #686.
TL;DR
Issue #686 hypothesized that personas under-classify findings as
safe_auto, makingce-work's headless auto-apply weaker than it could be. The hypothesis turned out to be approximately wrong — a 60-trial synthetic-fixture eval shows the post-#685 baseline already classifies textbook mechanical cases assafe_auto. 6 of 9 fixture shapes are identical across baseline and tightened rubric.The actual win the eval surfaced is variance reduction on ambiguous cases, not a
safe_auto-rate increase. Headline: orphan code without explicit "no callers" annotation. Baseline produced 3 different classifications across 4 trials on the same input (essentially random); tightened pins it deterministically togated_auto.What changed
subagent-template.mdsafe_auto, four "boundary cases that often feel risky but are still safe_auto" examples, anti-default guard forgated_autoparallel to existing one foradvisory.findings-schema.jsonautofix_classfield description rewritten to mirror the operational test from the subagent template.tests/review-skill-contract.test.tsdocs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.mdEval results — 60 trials, 9 fixture shapes
min_bysemantic bugThe F4b trade-off
Cross-file extraction of two service objects with identical bodies. Baseline picks
safe_auto(matches "extracting a duplicated helper" example). Tightened picksgated_auto(matches "naming/placement requires a design conversation" — Rails service-layering placement is a real architectural call). Both are internally consistent.Tightened picks the conservative reading, meaning ce-work's headless will flag cross-file extraction for user review instead of auto-applying it. For careful operators that's the right call; for autonomous bulk-refactor flows it's modestly more friction. Documented as a known trade-off in the calibration writeup.
What this PR doesn't do
The eval is single-persona on synthetic fixtures. Real reviews run multiple personas through synthesis with conservative tie-breaks; synthesis-layer effects could amplify or dampen what the persona-side eval shows. If a
safe_autounderclassification incident recurs on a real branch (the original "8 findings to tickets" story), that's evidence for another iteration.Test plan
bun test— 910/910 pass (1 new contract test, 12 new assertions)bun run release:validate— cleanReferences
fix(ce-code-review): replace LFG with best-judgment auto-resolve(the suggested_fix push this builds on)docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md— full calibration writeupdocs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md— the anchored confidence rubric this shares stylistic conventions with🤖 Generated with Claude Code