Skip to content

fix(ce-code-review): tighten autofix_class rubric for safe_auto/gated_auto boundary#695

Merged
tmchow merged 3 commits intomainfrom
tmchow/safe-auto-calibration
Apr 26, 2026
Merged

fix(ce-code-review): tighten autofix_class rubric for safe_auto/gated_auto boundary#695
tmchow merged 3 commits intomainfrom
tmchow/safe-auto-calibration

Conversation

@tmchow
Copy link
Copy Markdown
Collaborator

@tmchow tmchow commented Apr 26, 2026

Closes #686.

TL;DR

Issue #686 hypothesized that personas under-classify findings as safe_auto, making ce-work's headless auto-apply weaker than it could be. The hypothesis turned out to be approximately wrong — a 60-trial synthetic-fixture eval shows the post-#685 baseline already classifies textbook mechanical cases as safe_auto. 6 of 9 fixture shapes are identical across baseline and tightened rubric.

The actual win the eval surfaced is variance reduction on ambiguous cases, not a safe_auto-rate increase. Headline: orphan code without explicit "no callers" annotation. Baseline produced 3 different classifications across 4 trials on the same input (essentially random); tightened pins it deterministically to gated_auto.

What changed

File Lines Description
subagent-template.md +14 / −6 Decision-guide expansion: one-sentence symmetry-of-error framing, operational test for safe_auto, four "boundary cases that often feel risky but are still safe_auto" examples, anti-default guard for gated_auto parallel to existing one for advisory.
findings-schema.json +1 / −1 autofix_class field description rewritten to mirror the operational test from the subagent template.
tests/review-skill-contract.test.ts +28 / 0 New test asserting the rubric's new structural elements (boundary cases, operational test phrasing, anti-default guards).
docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md +127 / 0 Calibration writeup: eval methodology, 60-trial results matrix, the methodological lesson (calibrations should be evaluated for variance reduction first, rate-shift second), reproducibility notes.

Eval results — 60 trials, 9 fixture shapes

Fixture Baseline Tightened Verdict
F1 — internal nil guard 3/3 safe_auto 3/3 safe_auto identical
F1b — cart min_by semantic bug 3/3 safe_auto 3/3 safe_auto identical
F2 — off-by-one + parallel pattern 3/3 safe_auto 3/3 safe_auto identical
F3 — dead code w/ "no callers" comment 3/3 safe_auto 3/3 safe_auto identical
F4 — local helper extract 2/3 safe_auto, 1/3 advisory 3/3 safe_auto tightened reduces variance
F3b — orphan code, no explicit comment manual / safe_auto / gated_auto / safe_auto (4 trials, 3 classes) 7/7 gated_auto tightened dramatically reduces variance
F4b — cross-file Rails service extract 4/4 safe_auto 6/7 gated_auto, 1/7 advisory stable disagreement, both defensible
F5 — missing test for new public method 3/3 safe_auto 3/3 safe_auto identical
F6 — admin auth gate (negative control) 1/1 gated_auto 1/1 gated_auto identical (correctly stable)

The F4b trade-off

Cross-file extraction of two service objects with identical bodies. Baseline picks safe_auto (matches "extracting a duplicated helper" example). Tightened picks gated_auto (matches "naming/placement requires a design conversation" — Rails service-layering placement is a real architectural call). Both are internally consistent.

Tightened picks the conservative reading, meaning ce-work's headless will flag cross-file extraction for user review instead of auto-applying it. For careful operators that's the right call; for autonomous bulk-refactor flows it's modestly more friction. Documented as a known trade-off in the calibration writeup.

What this PR doesn't do

The eval is single-persona on synthetic fixtures. Real reviews run multiple personas through synthesis with conservative tie-breaks; synthesis-layer effects could amplify or dampen what the persona-side eval shows. If a safe_auto underclassification incident recurs on a real branch (the original "8 findings to tickets" story), that's evidence for another iteration.

Test plan

  • bun test — 910/910 pass (1 new contract test, 12 new assertions)
  • bun run release:validate — clean
  • Confirmed: the new rubric language is referenced in tests so future drift fails the contract test

References

  • fix(ce-code-review): replace LFG with best-judgment auto-resolve #685fix(ce-code-review): replace LFG with best-judgment auto-resolve (the suggested_fix push this builds on)
  • docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md — full calibration writeup
  • docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md — the anchored confidence rubric this shares stylistic conventions with

🤖 Generated with Claude Code

tmchow and others added 2 commits April 25, 2026 20:07
…_auto boundary

Issue #686 hypothesized that personas under-classify findings as `safe_auto`,
making `ce-work`'s headless auto-apply weaker than it could be. A 60-trial
synthetic-fixture eval (workspace at /tmp/safe-auto-eval/) found the
hypothesis approximately wrong: post-#685 personas already classify textbook
mechanical cases (nil guards, off-by-ones with parallel patterns, explicit
dead code, local helper extraction, missing tests) as `safe_auto`. 6 of 9
fixture shapes show identical classification across baseline and tightened
rubric.

What the rubric tightening actually does is reduce VARIANCE on cases where
the previous wording was genuinely ambiguous. The headline win is on orphan
code without explicit "no callers" annotation: baseline rubric produced
manual / safe_auto / gated_auto across 4 trials on the same input
(essentially random); tightened rubric pins it deterministically to
gated_auto by giving the persona a clearer test ("the surrounding refactor
obviously displaces it" requires positive signal, which absence-of-comment
fixtures lack).

The trade-off: cross-file Rails service extraction goes from baseline
`safe_auto` (4/4) to tightened `gated_auto` (6/7). Both classifications are
internally defensible — the baseline's matches the rubric's "extracting a
duplicated helper" example; the tightened catches "Rails service-layering
placement is a design conversation." The tightened picks the more
conservative reading, matching what a careful operator would want before
auto-applying a cross-file architectural extraction.

Net effect: variance reduction on ambiguous fixtures, no movement on
textbook ones, one stable defensible disagreement on cross-file extraction.

Two files changed:

- subagent-template.md autofix_class decision guide (~138-160): added
  one-sentence symmetry-of-error framing, an operational test for
  `safe_auto` (one-sentence fix, no "depends on" clauses, no contract /
  permission / signature / module-boundary change), four "boundary cases
  that often feel risky but are still safe_auto" examples (nil guards,
  off-by-ones, dead code, helper extraction with the cross-file
  discriminator), and a "do not default to gated_auto" anti-pattern guard
  parallel to the existing "do not default to advisory" guard.

- findings-schema.json autofix_class field description: replaced terse
  "Reviewer's conservative recommendation" with operational summary
  mirroring the subagent-template wording.

Tests: 910/910 pass (1 new test added with 12 assertions for the new rubric
language). release:validate clean.

Calibration writeup: docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md
documents the eval methodology, results, and the methodological lesson
(rubric calibrations should be evaluated for variance reduction first,
classification-rate-shift second).

Closes #686

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bration writeup

/ce-compound surfaced two methodological lessons from this calibration that
generalize beyond safe_auto: (1) measure variance reduction, not just
classification-rate-shift, when evaluating a persona-rubric prompt change;
(2) a synthetic-fixture eval harness with N>=3 trials per cell is the right
tier between "ship and watch" and "stare at the diff."

The Related Docs Finder scored both candidate new docs at 5/5 HIGH overlap
with this writeup, so per /ce-compound's overlap rule the right move is to
fold the content in here rather than create new files that would inevitably
drift apart.

Restructured the writeup to:
- Promote "Why this writeup matters more than the prompt change" to a named
  "Methodological lesson 1: variance reduction beats classification-rate-shift"
  section with the full N=1-misleads argument, three-tier evidence hierarchy,
  worked F3b example showing how three N=1 reads can produce three
  contradictory stories on the same fixture, and practical N>=3 rules.
- Promote "Eval reproducibility" to a named "Methodological lesson 2:
  validating persona-rubric prompt changes before shipping" section with the
  workspace pattern, persona-runner contract, fixture-matrix taxonomy
  (textbook positive / textbook negative / negative control / ambiguous
  boundary / stable-disagreement candidate), and step-by-step apply guide
  generalized beyond safe_auto.
- Add cross-reference to ce-doc-review-calibration-patterns-2026-04-19.md's
  "Reviewer variance is inherent" section as the precedent in this repo for
  the variance-as-noise warning.
- Add `last_updated: 2026-04-25` and additional tags (eval-methodology,
  variance) per the /ce-compound update convention.

Tests: 22/22 ce-code-review contract pass. release:validate clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0011b0548a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/solutions/skill-design/safe-auto-rubric-calibration-2026-04-25.md Outdated
…omment truncation

Codex review on PR #695 caught that `related_pr: PR #685 ...` in the
calibration writeup parses as just `'PR'` — YAML treats unquoted ' #' as
a comment delimiter, so everything from the `#` onward was silently
dropped. Any tooling that indexes or renders `related_pr` would lose
the linkage.

Fix: quote the value. Repo-wide sweep for the same pattern in docs/
frontmatter found one other instance with the same risk (and an
additional unquoted-colon issue compounding it):
docs/plans/2026-04-16-001-fix-ce-polish-beta-detection-gaps-plan.md.
Quoted that one too while in here.

Verified: both files now parse correctly via yaml.safe_load. Re-sweep
across docs/ frontmatter shows zero remaining ' #' instances.

Tests: 910/910 pass. release:validate clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tmchow tmchow merged commit ad9577e into main Apr 26, 2026
2 checks passed
@github-actions github-actions Bot mentioned this pull request Apr 26, 2026
michaelvolz pushed a commit to michaelvolz/compound-engineering-plugin-windows-version that referenced this pull request Apr 28, 2026
…_auto boundary (EveryInc#695)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Calibrate safe_auto vs gated_auto boundary in ce-code-review persona output

1 participant