refactor: slim expert-reviewer prompt — remove redundant instructions#656
refactor: slim expert-reviewer prompt — remove redundant instructions#656
Conversation
3da7c24 to
d092245
Compare
Multi-Model Code Review — PR #656 (merged)Reviewers: 3 independent reviewers, 3 review rounds Post-Merge Verificationinstruction-drift workflow triggered twice after merge:
Root cause: This is a gh-aw agent compliance issue — the agent understands the task and executes the scripts correctly, but doesn't follow through on calling the All review findings from 3 rounds remain resolved. The workflow infrastructure is correct — only the noop compliance needs iteration. |
|
/review |
|
❌ Expert Code Review failed. Please review the logs for details. |
|
/review |
|
✅ Expert Code Review completed successfully! |
Design-Level Findings (outside diff hunks)These findings affect code outside the changed diff hunks and cannot be posted as inline comments. 🟢 MINOR — Architecture security table contradicts new min-integrity guidance (3/3 reviewers after follow-up) File: The security boundaries table still says "Requires explicit min-integrity configuration" in the "What it doesn't do" column. This directly contradicts the new guidance added 10 lines later: "Do NOT set min-integrity explicitly." An agent reading this table would receive conflicting instructions. Fix: Update the table cell to: "Automatic lockdown may not match manual policy; blocked-users/trusted-users still require explicit config" 🟢 MINOR — Cap-3 disputed-findings rule only in File:
|
There was a problem hiding this comment.
Expert Code Review — PR #656
Methodology: 3 independent reviewers with adversarial consensus. Disputed findings (flagged by only 1/3) were escalated to the other 2 reviewers for validation. All findings below achieved 2/3+ agreement.
Findings Summary
| # | Severity | Consensus | Finding | File |
|---|---|---|---|---|
| 1 | 🟡 MODERATE | 3/3 | Concurrency groups don't match — /review won't cancel auto-review |
review-on-open.agent.md:17 |
| 2 | 🟡 MODERATE | 2/3 | Sub-agent recursion guard incomplete in standalone agent | expert-reviewer.agent.md:37 |
| 3 | 🟡 MODERATE | 3/3 ✱ | Disputed-finding cap has no severity ordering | review-shared.md:88 |
| 4 | 🟢 MINOR | 3/3 ✱ | Architecture security table contradicts min-integrity guidance | architecture.md:86 (outside diff) |
| 5 | 🟢 MINOR | 3/3 ✱ | Cap-3 rule divergence between agent file and shared config | expert-reviewer.agent.md § 3 (outside diff) |
✱ = initially flagged by 1/3, confirmed by follow-up reviewers
What Looks Good
- ✅ min-integrity removal is well-justified — compiler v0.62.2 bug is real, runtime
determine-automatic-lockdownis the standard pattern - ✅ Draft PR guard (
if: github.event.pull_request.draft == false) is correct in both source and compiled lock file - ✅ Sub-agent recursion prevention in
review-shared.mdis strong (explicit "do NOT dispatch sub-agents or use the task tool" guard) - ✅ Token optimizations (no pre-reading, cap disputed findings, 2-model follow-ups) are reasonable cost-reduction measures
- ✅ Noop guidance fix in instruction-drift workflow addresses a real failure mode
- ✅ Prompt slimming removes genuinely redundant instructions without losing review dimensions
- ✅ Lock files for
review-on-openmatch the source changes (draft guard, concurrency group both present)
Assessment
No 🔴 CRITICAL issues. The PR makes well-motivated improvements to the review workflow system. The three 🟡 MODERATE findings are all low-effort fixes that should be addressed before or shortly after merge:
- Concurrency mismatch (Finding 1) — the stated design goal (shared group) is not achieved. Add
concurrency:toreview.agent.mdand recompile. - Recursion guard (Finding 2) — one-line fix to
expert-reviewer.agent.mdline 37. - Severity ordering (Finding 3) — add prioritization to the cap rule.
CI & Test Coverage
This PR modifies only documentation/workflow files (.md, .yml, .ps1). No application code or tests are affected. No CI test failures expected.
Generated by Expert Code Review for issue #656
| contents: read | ||
| pull-requests: read | ||
|
|
||
| # Intentional: shared group with review.agent.md — /review cancels in-progress auto-review. |
There was a problem hiding this comment.
🟡 MODERATE — Concurrency groups don't match between review workflows (Flagged by: 3/3 reviewers)
The comment says "shared group with review.agent.md — /review cancels in-progress auto-review" but review.agent.md has no concurrency: block. The compiled lock files prove the mismatch:
review-on-open.agent.lock.yml:46→group: review-$\{\{ ... }}review.agent.lock.yml:51→group: "gh-aw-$\{\{ github.workflow }}-$\{\{ ... }}"
These groups never collide, so /review and auto-review run in parallel — posting duplicate reviews.
Fix: Add a matching concurrency: block to review.agent.md and recompile:
concurrency:
group: "review-$\{\{ github.event.issue.number || github.event.pull_request.number || github.run_id }}"
cancel-in-progress: false| > Read `.github/copilot-instructions.md` for project conventions. | ||
| > Read `.github/copilot-instructions.md` for project conventions and architecture. | ||
| > | ||
| > For each finding: file path, line number (within a `@@` diff hunk — mark "outside diff" if not), severity (🔴 CRITICAL, 🟡 MODERATE, 🟢 MINOR), concrete failing scenario, and fix suggestion. Return findings as text — do NOT call safe-output tools. |
There was a problem hiding this comment.
🟡 MODERATE — Sub-agent recursion guard incomplete (Flagged by: 2/3 reviewers)
This line says "do NOT call safe-output tools" but does not prohibit the task tool. The review-shared.md prompt (line 75) has the full guard: "do NOT dispatch sub-agents or use the task tool — act as an individual reviewer only." But when expert-reviewer.agent.md is invoked standalone (not via review workflows), sub-agents receive only this incomplete guard and could recursively spawn further sub-agents.
Fix: Append to this line:
Return findings as text — do NOT call safe-output tools, do NOT dispatch sub-agents or use the task tool.
| 3. **Only 1/3 flagged** → dispatch **exactly 2** follow-up sub-agents (the other 2 models that didn't flag it) asking: "Reviewer X found this issue: [finding]. Do you agree or disagree? Explain why." | ||
| - If 2+ now agree → include | ||
| - If still 1/3 → discard (note as "discarded — single reviewer only") | ||
| - **Cap at 3 disputed findings** — if more than 3 findings are 1/3, discard the rest without follow-up to preserve token budget for posting. |
There was a problem hiding this comment.
🟡 MODERATE — Disputed-finding cap has no severity ordering (Flagged by: 3/3 reviewers after follow-up)
The cap discards excess 1/3 findings "without follow-up" but specifies no ordering. If 4+ findings are disputed and only 3 get follow-up slots, the selection is arbitrary — a 🔴 CRITICAL finding could be silently dropped while 🟢 MINOR ones consume the budget.
Fix: Add severity prioritization:
- **Cap at 3 disputed findings** (prioritized by severity: 🔴 > 🟡 > 🟢) — if more than 3 findings
are 1/3, discard the lowest-severity remainder without follow-up. Note discarded findings in the
review body.
…abs learnings Bundled changes from 3 review rounds (8 findings, all resolved): Expert reviewer: - Slimmed prompt (72→62 lines) — removed redundant copilot-instructions refs - Restored PolyPilot identity + copilot-instructions read in sub-agent prompt - Added anti-recursion guard: 'Do NOT dispatch sub-agents' - Restored REQUEST_CHANGES rationale min-integrity fix: - Removed explicit min-integrity: approved from review-shared.md (compiler v0.62.2 crashes MCP Gateway — missing repos field) - Updated all skill docs, instructions, and security scanner - Rely on runtime determine-automatic-lockdown instead maui-labs learnings: - Draft PR guard on review-on-open (if: draft == false) - Concurrency groups matching between review.agent.md and review-on-open - cancel-in-progress: false on both (slash_command safety) Other: - CLI Commands section in SKILL.md (gh aw trial/run/audit) - Strengthened noop guidance in instruction-drift workflow - Fixed defense table stale min-integrity text Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
eb44acc to
13a5369
Compare
Based on analysis of 3 actual review runs (PRs #619, #639, #635), several prompt instructions are redundant and add tokens without improving output quality.
What was removed:
What was kept:
72 lines → 60 lines. Same review quality — the evidence from real runs shows the agent finds domain issues from reading code, not from hint bullets.