fix(ce-code-review): replace LFG with best-judgment auto-resolve#685
fix(ce-code-review): replace LFG with best-judgment auto-resolve#685
Conversation
LFG previously routed too many findings to ticket-filing when the agent could propose concrete fixes from review context. Stage 5b validation and bulk-preview approval ran before any action, requiring upfront research the user already opted out of by picking LFG. Push personas to commit to a `suggested_fix` whenever any defensible code change is reachable from the diff -- imperfect information is not grounds for omission. `manual` findings with `suggested_fix` now recommend Apply (was Defer). Drop Stage 5b and bulk-preview on the LFG path; dispatch the fixer immediately on the full pending set. Items that fail to apply or lack a proposed fix surface in a `failed` bucket with a one-line reason; one post-run question (file tickets / walk through / ignore) handles them only when the bucket is non-empty, fired before the completion report so it reflects final state. Walk-through option A inherits the action-derivation change automatically -- per-finding recommendations flip from Defer to Apply when the persona proposed a fix. `LFG the rest` from inside the walk-through dispatches one fixer pass on the union of accumulated Apply set and remaining undecided findings, preserving the "one fixer, consistent tree" contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f1e36e990b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Address PR #685 review feedback: - Step 3 fixer queue contract previously defined behavior for gated_auto/manual WITH suggested_fix and manual WITHOUT, but left gated_auto WITHOUT suggested_fix undefined. Extend the no-fix branch to cover both gated_auto and manual so they route to failed instead of being silently skipped, preserving the apply-or-fail contract. - walkthrough.md still documented LFG-the-rest -> Proceed/Cancel and end-of-loop dispatch semantics from the removed bulk-preview path. Reconcile: scope end-of-walk-through dispatch to the run-to-completion path and explicitly point to the LFG-the-rest bullet for that path's single-dispatch semantics. Remove stale Proceed/Cancel framings and the bulk-preview reference in the no-sink adaptation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 530560a3eb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ollision The interactive routing option B was labeled "LFG" in user-facing prose and "the LFG path" in skill internals. That name collides with the /lfg macro skill in this repo, creating confusion when an agent reads the ce-code-review skill while also having /lfg in scope. Rename: - User-facing menu (option B): "LFG. Apply the agent's best-judgment action per finding" -> "Auto-resolve with best judgment -- apply per-finding fixes the agent can defend, surface the rest" - User-facing per-finding option D: "LFG the rest" -> "Auto-resolve with best judgment on the rest" - Internal naming: "LFG path" / "LFG routing (option B)" -> "best-judgment path" / "best-judgment routing (option B)" - Routing tables, Step 3 fixer queue prose, completion-report bucket list, and other internal references updated to match - Test assertions updated against the new labels The user-facing label leads with "Auto-resolve" so the action is clear on scan; "with best judgment" qualifies how decisions are made. The internal name uses "best-judgment" (not "auto-resolve") to avoid lexical collision with the existing mode:autofix flag, since the two do different things and live at different structural levels -- autofix is a top-level mode while best-judgment is a routing option within interactive mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 33e00c4047
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…dering Address PR #685 review feedback: - walkthrough.md: add a "No suggested_fix (Apply suppressed)" adaptation so the entry-point menu omits Apply for findings without a concrete suggested_fix. Stage 5 step 6b already maps these to a Defer recommendation; this matches the suppression applied during the post-run "Walk through these one at a time" re-entry path so the same handling applies regardless of which entry path the user came in through. - SKILL.md Step 3: extend the homogeneous queue contract with a defensive backstop. If a no-suggested_fix finding ever slips into the walk-through Apply set, the fixer routes it to failed with reason "no fix proposed by reviewer", mirroring the heterogeneous queue's apply-or-fail handling. Autofix and headless callers are unaffected (they only ever process safe_auto items). - SKILL.md Step 3: update the "Best-judgment path is single-pass" paragraph to reference Step 2 option B's two-branch ordering (failed empty -> emit report directly; non-empty -> question first, then execute, then report). Removes a contradiction the prior reconciliation commit (530560a) didn't catch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…re-reason taxonomy
Two polish fixes surfaced by interactive eval runs of the best-judgment path:
- Sink-omission stem text ("tracker sink", "on this platform") was being copied
verbatim from spec `e.g.` examples into user-facing output. The phrasing is
jargon-y and incorrectly implies a per-agent-platform constraint when the
missing piece is per-project. Replaced the three `e.g.` strings (SKILL.md
options C and post-run failure handler, walkthrough.md no-sink adaptation)
with description-not-prescription guidance that names what to convey
(no issue tracker configured for this checkout) and explicitly bans the
bad phrasing so the model doesn't regenerate it.
- Failure-reason taxonomy in Step 3 was specified as four canonical strings,
but real runs produced richer, finding-specific reasons (e.g., "needs intent
confirmation; was the field narrowing deliberate, or do clients still need
the full payload?") that read better in the post-run question framing.
Reframed each canonical string as "default phrasing when nothing more
specific applies" and added a worked example showing the level of
context-specific detail to prefer. Routing categories stay fixed.
Tests: 908/908 pass. Contract assertions still match the canonical phrases
(now serving as defaults).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f20954e2ea
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| ### Adaptations | ||
|
|
||
| - **No `suggested_fix` (Apply suppressed):** when the finding has no concrete `suggested_fix` (`gated_auto` or `manual` with `suggested_fix == null`), option A (`Apply`) is **omitted from the menu**. Stage 5 step 6b already maps these to a `Defer` recommendation, so the `(recommended)` marker lands on a still-visible option. The menu shows three options: `Defer` / `Skip` / `Auto-resolve with best judgment on the rest` (and reduces to `Skip` / `Auto-resolve with best judgment on the rest` when combined with the no-sink adaptation). When this combines with the advisory variant, the same suppression is moot because option A is already replaced with `Acknowledge`. This rule mirrors the suppression applied during `SKILL.md` Step 2 Interactive option B's post-run `Walk through these one at a time` re-entry, so the same handling applies regardless of which entry path the user came in through. |
There was a problem hiding this comment.
Handle no-suggested_fix in N=1/no-sink adaptation
The new No suggested_fix (Apply suppressed) rule says Apply should be omitted, but the menu logic still has a separate N=1 + no sink case that keeps Apply / Skip. When a single remaining finding has no suggested_fix and no tracker sink is available (a reachable post-run failed-item scenario), these adaptations conflict and can re-enable Apply for an item the fixer cannot safely execute, violating the intended apply-or-fail guard.
Useful? React with 👍 / 👎.
| - **Heterogeneous queue (best-judgment path — interactive option B and walk-through's `Auto-resolve with best judgment on the rest`):** the queue mixes `gated_auto`, `manual`, and `advisory` findings. Each item carries: `autofix_class`, `severity`, `file:line`, `title`, `suggested_fix` (may be null), `why_it_matters`, and `evidence`. The fixer routes each item to one of four buckets — the routing categories are fixed; the failure *reason string* should be specific enough that the post-run question's framing (`N findings could not be auto-resolved...`) reads meaningfully to the user. Use the category's default phrasing below when nothing more specific applies; prefer richer, finding-specific reasons that capture *why this particular item didn't land* (e.g., `needs intent confirmation; was the field narrowing deliberate, or do clients still need the full payload?` is more useful than the generic default). | ||
| - **`safe_auto` / `gated_auto` / `manual` with `suggested_fix`:** light evidence-match check (verify the cited code at `file:line` still resembles the persona's evidence — concretely: at least one identifier or distinctive token from the evidence appears at the cited location, and the line has not been deleted). If the check passes, attempt to apply the fix. On clean apply, route to `applied`. On fix-application failure (line moved, conflicting edit, syntax issue), route to `failed` with a concrete reason — default phrasing `fix did not apply cleanly: <error>` when no richer description fits. |
There was a problem hiding this comment.
Define fallback when heterogeneous queue lacks evidence payload
This queue contract now requires why_it_matters and evidence on every best-judgment item and makes routing depend on an evidence-match check, but the overall review flow still allows persona artifact writes to fail and proceed with compact returns only. In that documented write-failure path, Option B has no guaranteed evidence source, so the fixer behavior is undefined (it cannot reliably perform the required evidence gate). Add an explicit fallback for missing artifact detail to keep best-judgment routing deterministic.
Useful? React with 👍 / 👎.
…omment truncation Codex review on PR #695 caught that `related_pr: PR #685 ...` in the calibration writeup parses as just `'PR'` — YAML treats unquoted ' #' as a comment delimiter, so everything from the `#` onward was silently dropped. Any tooling that indexes or renders `related_pr` would lose the linkage. Fix: quote the value. Repo-wide sweep for the same pattern in docs/ frontmatter found one other instance with the same risk (and an additional unquoted-colon issue compounding it): docs/plans/2026-04-16-001-fix-ce-polish-beta-detection-gaps-plan.md. Quoted that one too while in here. Verified: both files now parse correctly via yaml.safe_load. Re-sweep across docs/ frontmatter shows zero remaining ' #' instances. Tests: 910/910 pass. release:validate clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ryInc#685) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
/ce-code-reviewhad a routing option (formerly "LFG") meant for users who wanted the agent to just go — but it filed tickets for findings the agent could fix. On a 13-finding branch, 8 ended up in the file-tickets bucket; when prompted, the agent could propose concrete fixes for all 8 without external context. This PR makes "go fix things" actually do that, drops the approval gates that ran before any action, and renames the path to "Auto-resolve with best judgment" so the label matches what it does (and stops colliding with the/lfgmacro skill in this repo).What changed
Three structural shifts work together:
Personas commit to a fix when they can defend one. The subagent template now treats
suggested_fixas the authoritative signal for "the agent can act on this." Personas propose a defensible fix whenever one is reachable from review context (parallel patterns, framework conventions, the cited code itself). Imperfect information is not grounds for omission — propose the most defensible default and name the assumption.manualfindings withoutsuggested_fixare reserved for genuinely handoff-only cases (no defensible code change, or purely organizational actions like legal sign-off).Option B applies fixes directly; no preview gate. Stage 5b validator pre-pass and bulk-preview approval no longer run on this path. The fixer dispatches immediately on the full pending action set (
gated_auto+manual+advisory) and returns a{applied, failed, advisory}partition. The fixer's apply/fail outcome is the validation. Items that fail get a one-line reason (fix did not apply cleanly,evidence no longer matches code at <file:line>,no fix proposed by reviewer). One post-run question fires only whenfailedis non-empty (file tickets / walk through / ignore), fired before the completion report so the report reflects final state.Renamed: LFG → best-judgment auto-resolve. User-facing label:
Auto-resolve with best judgment — apply per-finding fixes the agent can defend, surface the rest. Walk-through option D mirrors it:Auto-resolve with best judgment on the rest. Internal naming uses "best-judgment path" (not "auto-resolve path") to avoid lexical collision withmode:autofix— the two live at different structural levels (autofix is a top-level mode; best-judgment is a routing option within interactive mode), and naming them similarly would invite model confusion in routing tables and prose.Other paths are unchanged: walk-through option A inherits the action-derivation rule automatically (per-finding recommendations flip from Defer to Apply when the persona proposed a fix); option C (file tickets) keeps Stage 5b and bulk-preview because filing produces durable external state worth previewing; autofix and headless retain their existing single-pass
safe_autocontracts.Why this matters
On uncommitted local edits — where this path runs — reverting an applied edit is cheaper than closing 8 GitHub issues that should never have been filed. The previous defer-bias inverted that cost calculus by treating ticket-filing as the safer default.
Calibration
Whether personas actually populate
suggested_fixmore often than before is empirical and won't be answered until real review runs land. The new template wording is strong and explicit about the failure mode, but if runs show under-compliance,references/subagent-template.mdis where to tune. The richer payload also surfaces a separate question about whethersafe_autoitself is undersized — tracked separately in #686.Test plan
tests/review-skill-contract.test.tsupdated with assertions for the new flow contract: no Stage 5b on the best-judgment path, no bulk-preview gate on best-judgment, post-run failure-handling question structure, heterogeneous fixer queue, per-class failure reason taxonomy. Assertions match against the new labels (Auto-resolve with best judgment,best-judgment routing (option B),(B)), not the oldLFGstrings.bun run release:validateclean.References
docs/plans/2026-04-25-001-fix-ce-code-review-lfg-defer-bias-plan.mdce-doc-review's LFG terminology)docs/plans/2026-04-21-002-refactor-ce-code-review-precision-and-validation-plan.mddocs/plans/2026-04-17-002-feat-ce-review-interactive-judgment-plan.md