Skip to content

test: label-reset probe (DO NOT MERGE)#1025

Closed
danielmeppiel wants to merge 2 commits intorefactor/review-panel-fanoutfrom
test/label-reset-probe
Closed

test: label-reset probe (DO NOT MERGE)#1025
danielmeppiel wants to merge 2 commits intorefactor/review-panel-fanoutfrom
test/label-reset-probe

Conversation

@danielmeppiel
Copy link
Copy Markdown
Collaborator

Probe PR to validate pr-panel-label-reset.yml. Will be closed after the test. Base = refactor/review-panel-fanout so the workflow file is loaded from the feature branch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel danielmeppiel added the panel-rejected Apm-review-panel verdict: REJECT. Removed automatically on next push. label Apr 28, 2026
@github-actions github-actions Bot removed the panel-rejected Apm-review-panel verdict: REJECT. Removed automatically on next push. label Apr 28, 2026
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Cleanup: probe done. Reset workflow validated -- panel-rejected was stripped within ~13s of synchronize. See PR #1022 description for full evidence.

@danielmeppiel danielmeppiel deleted the test/label-reset-probe branch April 28, 2026 20:37
danielmeppiel added a commit that referenced this pull request Apr 28, 2026
Probe PR #1025 (closed). Reset workflow stripped panel-rejected
within 13s of pull_request:synchronize. panel-approved path is
identical code in the same loop -- proven by parity.

Side finding: created the missing 'panel-approved' repo label
(green) and refreshed 'panel-rejected' description+color.
danielmeppiel added a commit that referenced this pull request Apr 28, 2026
…ion (#1022)

* refactor(review-panel): true fan-out + binary verdict + label automation

Refactor apm-review-panel from PANEL-IN-ONE-CONTEXT anti-pattern
(architectural-patterns.md A1) to true A1 PANEL realized via B1
FAN-OUT + SYNTHESIZER. Each panelist runs in its own task-tool
agent thread and returns JSON; the orchestrator schema-validates,
hands all returns to the apm-ceo synthesizer task, then derives a
binary verdict deterministically and is the SOLE writer to the PR.

What changes for users:
- Verdict is now binary: APPROVE or REJECT. No 'approve with
  reservations'. The schema makes that structurally impossible.
- Two severity buckets only: required (blocks merge) and nits
  (one-liner, skip if you want). No third bucket accumulates debt.
- Auto-label panel-approved or panel-rejected on every panel run.
- Trigger label panel-review is removed after the run, so re-applying
  it re-runs the panel cleanly.
- New companion workflow pr-panel-label-reset.yml (plain GitHub
  Actions, no LLM) strips both verdict labels on every new push so a
  stale verdict can never linger past a code change.
- Top-loaded comment: verdict + required + nits + CEO arbitration on
  top; per-persona detail collapsed in <details> at the bottom.
- Comment cap drops from 7 to 2 (one CEO comment + one safety
  overflow).

What changes architecturally:
- Each persona .agent.md gets a new 'Output contract when invoked by
  apm-review-panel' section: return JSON only, no GitHub writes.
- Two new JSON schemas (panelist-return-schema.json,
  ceo-return-schema.json) define the cross-thread contract.
- Single-writer interlock: only the orchestrator touches
  safe-outputs.
- S4 schema gate: malformed panelist returns trigger a re-spawn.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test(apm-review-panel): add deterministic verdict harness

Genesis Step 8 evals gate, deterministic slice. Validates the parts
of the panel that do NOT require an LLM:

- JSON schema validation (S4 gate) for panelist + CEO returns
- Verdict computation per orchestrator rule (APPROVE iff sum(required)==0)

Five cases: clean-pr APPROVE, rejected-pr REJECT, plus three negative
cases (missing-nits, unknown-persona, disposition-leak) that confirm
`additionalProperties: false` and `required` constraints reject
malformed shapes before they reach the verdict gate.

Run: uv run --with jsonschema python3 \
    .apm/skills/apm-review-panel/evals/run-verdict-harness.py

Result: ALL PASS.

Does NOT replace the option B branch-pin end-to-end test (which is
required to prove an actual LLM panelist returns well-formed JSON);
documented in evals/README.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test(apm-review-panel): record trigger eval run 2026-04-28

LLM dispatcher self-evaluation against trigger-evals.json (16 queries,
60/40 train/val split).

Result: 16/16 correct. Validation split:
- should-trigger: 3/3 = 1.00 (gate >= 0.5: PASS)
- should-NOT:     3/3 = 1.00 (gate >= 0.5: PASS)

Caveat recorded in the result file: this is a single-LLM judgment;
canonical evals would average over multiple dispatcher models. Real
LLM judgment, not hand-waving.

Content evals (with-skill vs without-skill) still require either real
6-persona fan-out via task tool OR option B branch-pin gh-aw run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore(TEMP): pin panel to refactor/review-panel-fanout for e2e eval

DO NOT MERGE WITHOUT REVERTING.

Temporary pin so the gh-aw workflow loads the refactored panel skill
and persona contracts from this branch instead of microsoft/apm#main,
enabling pre-merge end-to-end validation per option B.

To revert: change packages back to microsoft/apm#main, recompile, push.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test(apm-review-panel): record e2e eval result + revert temp pin

End-to-end gh-aw run of the refactored panel against PR #931 succeeded
on first try with all 10 acceptance criteria passing:

- Verdict header literal: '## APM Review Panel Verdict: REJECT'
- Top-loaded order: verdict -> required (3) -> nits (4) -> CEO -> per-persona
- All 6 personas spawned and reported back
- auth-expert correctly inactive with cited reason
- Verdict deterministic from required[] count (3 -> REJECT)
- panel-rejected label applied
- Comment count: 1 (single-writer interlock held)
- Per-persona detail in collapsed <details> block
- Python Architect class diagram passed through extras
- All 7 gh-aw jobs SUCCESS

Bonus: panel surfaced a real regression in PR #931 that prior
reviewers missed (proposed docstring silently drops gemini routing).
Three independent panelists converged with zero dissent.

Workflow run: github.com/microsoft/apm/actions/runs/25069734881
Full result: .apm/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md

Reverts the temporary microsoft/apm#refactor/review-panel-fanout pin
back to microsoft/apm#main. Recompiled lock file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test(apm-review-panel): record label-reset workflow validation

Probe PR #1025 (closed). Reset workflow stripped panel-rejected
within 13s of pull_request:synchronize. panel-approved path is
identical code in the same loop -- proven by parity.

Side finding: created the missing 'panel-approved' repo label
(green) and refreshed 'panel-rejected' description+color.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant