test: label-reset probe (DO NOT MERGE)#1025
Closed
danielmeppiel wants to merge 2 commits intorefactor/review-panel-fanoutfrom
Closed
test: label-reset probe (DO NOT MERGE)#1025danielmeppiel wants to merge 2 commits intorefactor/review-panel-fanoutfrom
danielmeppiel wants to merge 2 commits intorefactor/review-panel-fanoutfrom
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collaborator
Author
|
Cleanup: probe done. Reset workflow validated -- panel-rejected was stripped within ~13s of synchronize. See PR #1022 description for full evidence. |
danielmeppiel
added a commit
that referenced
this pull request
Apr 28, 2026
Probe PR #1025 (closed). Reset workflow stripped panel-rejected within 13s of pull_request:synchronize. panel-approved path is identical code in the same loop -- proven by parity. Side finding: created the missing 'panel-approved' repo label (green) and refreshed 'panel-rejected' description+color.
danielmeppiel
added a commit
that referenced
this pull request
Apr 28, 2026
…ion (#1022) * refactor(review-panel): true fan-out + binary verdict + label automation Refactor apm-review-panel from PANEL-IN-ONE-CONTEXT anti-pattern (architectural-patterns.md A1) to true A1 PANEL realized via B1 FAN-OUT + SYNTHESIZER. Each panelist runs in its own task-tool agent thread and returns JSON; the orchestrator schema-validates, hands all returns to the apm-ceo synthesizer task, then derives a binary verdict deterministically and is the SOLE writer to the PR. What changes for users: - Verdict is now binary: APPROVE or REJECT. No 'approve with reservations'. The schema makes that structurally impossible. - Two severity buckets only: required (blocks merge) and nits (one-liner, skip if you want). No third bucket accumulates debt. - Auto-label panel-approved or panel-rejected on every panel run. - Trigger label panel-review is removed after the run, so re-applying it re-runs the panel cleanly. - New companion workflow pr-panel-label-reset.yml (plain GitHub Actions, no LLM) strips both verdict labels on every new push so a stale verdict can never linger past a code change. - Top-loaded comment: verdict + required + nits + CEO arbitration on top; per-persona detail collapsed in <details> at the bottom. - Comment cap drops from 7 to 2 (one CEO comment + one safety overflow). What changes architecturally: - Each persona .agent.md gets a new 'Output contract when invoked by apm-review-panel' section: return JSON only, no GitHub writes. - Two new JSON schemas (panelist-return-schema.json, ceo-return-schema.json) define the cross-thread contract. - Single-writer interlock: only the orchestrator touches safe-outputs. - S4 schema gate: malformed panelist returns trigger a re-spawn. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test(apm-review-panel): add deterministic verdict harness Genesis Step 8 evals gate, deterministic slice. Validates the parts of the panel that do NOT require an LLM: - JSON schema validation (S4 gate) for panelist + CEO returns - Verdict computation per orchestrator rule (APPROVE iff sum(required)==0) Five cases: clean-pr APPROVE, rejected-pr REJECT, plus three negative cases (missing-nits, unknown-persona, disposition-leak) that confirm `additionalProperties: false` and `required` constraints reject malformed shapes before they reach the verdict gate. Run: uv run --with jsonschema python3 \ .apm/skills/apm-review-panel/evals/run-verdict-harness.py Result: ALL PASS. Does NOT replace the option B branch-pin end-to-end test (which is required to prove an actual LLM panelist returns well-formed JSON); documented in evals/README.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test(apm-review-panel): record trigger eval run 2026-04-28 LLM dispatcher self-evaluation against trigger-evals.json (16 queries, 60/40 train/val split). Result: 16/16 correct. Validation split: - should-trigger: 3/3 = 1.00 (gate >= 0.5: PASS) - should-NOT: 3/3 = 1.00 (gate >= 0.5: PASS) Caveat recorded in the result file: this is a single-LLM judgment; canonical evals would average over multiple dispatcher models. Real LLM judgment, not hand-waving. Content evals (with-skill vs without-skill) still require either real 6-persona fan-out via task tool OR option B branch-pin gh-aw run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore(TEMP): pin panel to refactor/review-panel-fanout for e2e eval DO NOT MERGE WITHOUT REVERTING. Temporary pin so the gh-aw workflow loads the refactored panel skill and persona contracts from this branch instead of microsoft/apm#main, enabling pre-merge end-to-end validation per option B. To revert: change packages back to microsoft/apm#main, recompile, push. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test(apm-review-panel): record e2e eval result + revert temp pin End-to-end gh-aw run of the refactored panel against PR #931 succeeded on first try with all 10 acceptance criteria passing: - Verdict header literal: '## APM Review Panel Verdict: REJECT' - Top-loaded order: verdict -> required (3) -> nits (4) -> CEO -> per-persona - All 6 personas spawned and reported back - auth-expert correctly inactive with cited reason - Verdict deterministic from required[] count (3 -> REJECT) - panel-rejected label applied - Comment count: 1 (single-writer interlock held) - Per-persona detail in collapsed <details> block - Python Architect class diagram passed through extras - All 7 gh-aw jobs SUCCESS Bonus: panel surfaced a real regression in PR #931 that prior reviewers missed (proposed docstring silently drops gemini routing). Three independent panelists converged with zero dissent. Workflow run: github.com/microsoft/apm/actions/runs/25069734881 Full result: .apm/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md Reverts the temporary microsoft/apm#refactor/review-panel-fanout pin back to microsoft/apm#main. Recompiled lock file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test(apm-review-panel): record label-reset workflow validation Probe PR #1025 (closed). Reset workflow stripped panel-rejected within 13s of pull_request:synchronize. panel-approved path is identical code in the same loop -- proven by parity. Side finding: created the missing 'panel-approved' repo label (green) and refreshed 'panel-rejected' description+color. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Probe PR to validate pr-panel-label-reset.yml. Will be closed after the test. Base = refactor/review-panel-fanout so the workflow file is loaded from the feature branch.