test: label-reset probe (DO NOT MERGE) by danielmeppiel · Pull Request #1025 · microsoft/apm

danielmeppiel · 2026-04-28T20:29:10Z

Probe PR to validate pr-panel-label-reset.yml. Will be closed after the test. Base = refactor/review-panel-fanout so the workflow file is loaded from the feature branch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danielmeppiel · 2026-04-28T20:37:39Z

Cleanup: probe done. Reset workflow validated -- panel-rejected was stripped within ~13s of synchronize. See PR #1022 description for full evidence.

Probe PR #1025 (closed). Reset workflow stripped panel-rejected within 13s of pull_request:synchronize. panel-approved path is identical code in the same loop -- proven by parity. Side finding: created the missing 'panel-approved' repo label (green) and refreshed 'panel-rejected' description+color.

…ion (#1022) * refactor(review-panel): true fan-out + binary verdict + label automation Refactor apm-review-panel from PANEL-IN-ONE-CONTEXT anti-pattern (architectural-patterns.md A1) to true A1 PANEL realized via B1 FAN-OUT + SYNTHESIZER. Each panelist runs in its own task-tool agent thread and returns JSON; the orchestrator schema-validates, hands all returns to the apm-ceo synthesizer task, then derives a binary verdict deterministically and is the SOLE writer to the PR. What changes for users: - Verdict is now binary: APPROVE or REJECT. No 'approve with reservations'. The schema makes that structurally impossible. - Two severity buckets only: required (blocks merge) and nits (one-liner, skip if you want). No third bucket accumulates debt. - Auto-label panel-approved or panel-rejected on every panel run. - Trigger label panel-review is removed after the run, so re-applying it re-runs the panel cleanly. - New companion workflow pr-panel-label-reset.yml (plain GitHub Actions, no LLM) strips both verdict labels on every new push so a stale verdict can never linger past a code change. - Top-loaded comment: verdict + required + nits + CEO arbitration on top; per-persona detail collapsed in <details> at the bottom. - Comment cap drops from 7 to 2 (one CEO comment + one safety overflow). What changes architecturally: - Each persona .agent.md gets a new 'Output contract when invoked by apm-review-panel' section: return JSON only, no GitHub writes. - Two new JSON schemas (panelist-return-schema.json, ceo-return-schema.json) define the cross-thread contract. - Single-writer interlock: only the orchestrator touches safe-outputs. - S4 schema gate: malformed panelist returns trigger a re-spawn. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test(apm-review-panel): add deterministic verdict harness Genesis Step 8 evals gate, deterministic slice. Validates the parts of the panel that do NOT require an LLM: - JSON schema validation (S4 gate) for panelist + CEO returns - Verdict computation per orchestrator rule (APPROVE iff sum(required)==0) Five cases: clean-pr APPROVE, rejected-pr REJECT, plus three negative cases (missing-nits, unknown-persona, disposition-leak) that confirm `additionalProperties: false` and `required` constraints reject malformed shapes before they reach the verdict gate. Run: uv run --with jsonschema python3 \ .apm/skills/apm-review-panel/evals/run-verdict-harness.py Result: ALL PASS. Does NOT replace the option B branch-pin end-to-end test (which is required to prove an actual LLM panelist returns well-formed JSON); documented in evals/README.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test(apm-review-panel): record trigger eval run 2026-04-28 LLM dispatcher self-evaluation against trigger-evals.json (16 queries, 60/40 train/val split). Result: 16/16 correct. Validation split: - should-trigger: 3/3 = 1.00 (gate >= 0.5: PASS) - should-NOT: 3/3 = 1.00 (gate >= 0.5: PASS) Caveat recorded in the result file: this is a single-LLM judgment; canonical evals would average over multiple dispatcher models. Real LLM judgment, not hand-waving. Content evals (with-skill vs without-skill) still require either real 6-persona fan-out via task tool OR option B branch-pin gh-aw run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore(TEMP): pin panel to refactor/review-panel-fanout for e2e eval DO NOT MERGE WITHOUT REVERTING. Temporary pin so the gh-aw workflow loads the refactored panel skill and persona contracts from this branch instead of microsoft/apm#main, enabling pre-merge end-to-end validation per option B. To revert: change packages back to microsoft/apm#main, recompile, push. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test(apm-review-panel): record e2e eval result + revert temp pin End-to-end gh-aw run of the refactored panel against PR #931 succeeded on first try with all 10 acceptance criteria passing: - Verdict header literal: '## APM Review Panel Verdict: REJECT' - Top-loaded order: verdict -> required (3) -> nits (4) -> CEO -> per-persona - All 6 personas spawned and reported back - auth-expert correctly inactive with cited reason - Verdict deterministic from required[] count (3 -> REJECT) - panel-rejected label applied - Comment count: 1 (single-writer interlock held) - Per-persona detail in collapsed <details> block - Python Architect class diagram passed through extras - All 7 gh-aw jobs SUCCESS Bonus: panel surfaced a real regression in PR #931 that prior reviewers missed (proposed docstring silently drops gemini routing). Three independent panelists converged with zero dissent. Workflow run: github.com/microsoft/apm/actions/runs/25069734881 Full result: .apm/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md Reverts the temporary microsoft/apm#refactor/review-panel-fanout pin back to microsoft/apm#main. Recompiled lock file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test(apm-review-panel): record label-reset workflow validation Probe PR #1025 (closed). Reset workflow stripped panel-rejected within 13s of pull_request:synchronize. panel-approved path is identical code in the same loop -- proven by parity. Side finding: created the missing 'panel-approved' repo label (green) and refreshed 'panel-rejected' description+color. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

test: probe commit 1 for label-reset workflow validation

bbeaf62

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danielmeppiel added the panel-rejected Apm-review-panel verdict: REJECT. Removed automatically on next push. label Apr 28, 2026

test: probe 2 for label reset

4117edd

github-actions Bot removed the panel-rejected Apm-review-panel verdict: REJECT. Removed automatically on next push. label Apr 28, 2026

danielmeppiel closed this Apr 28, 2026

danielmeppiel deleted the test/label-reset-probe branch April 28, 2026 20:37

danielmeppiel mentioned this pull request Apr 28, 2026

refactor(review-panel): true fan-out + binary verdict + label automation #1022

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: label-reset probe (DO NOT MERGE)#1025

test: label-reset probe (DO NOT MERGE)#1025
danielmeppiel wants to merge 2 commits intorefactor/review-panel-fanoutfrom
test/label-reset-probe

danielmeppiel commented Apr 28, 2026

Uh oh!

danielmeppiel commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielmeppiel commented Apr 28, 2026

Uh oh!

danielmeppiel commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant