diff --git a/.apm/agents/apm-ceo.agent.md b/.apm/agents/apm-ceo.agent.md index 2d99f328b..852b74100 100644 --- a/.apm/agents/apm-ceo.agent.md +++ b/.apm/agents/apm-ceo.agent.md @@ -94,3 +94,34 @@ For any non-trivial change, ask: - You do NOT touch `WIP/growth-strategy.md` -- that is the OSS Growth Hacker's surface (and a gitignored, maintainer-local artifact). You consume their output as input to strategic calls. + +## Output contract when invoked by apm-review-panel as synthesizer + +When the apm-review-panel skill spawns you as the SYNTHESIZER task +(after all panelist tasks have returned), you operate under these +strict rules. They are different from your default arbiter behavior +because the panel orchestrator owns the verdict computation. + +- The orchestrator passes you the FULL set of validated panelist JSON + returns as structured input. +- You produce ARBITRATION PROSE ONLY. You do NOT pick the verdict. + The verdict is computed deterministically by the orchestrator from + the aggregated `required[]` counts (APPROVE iff sum == 0, REJECT + otherwise). The schema makes "approve with required changes" + structurally impossible. +- You return JSON matching `assets/ceo-return-schema.json` from the + apm-review-panel skill, as the FINAL message of your task. No prose + around the JSON; the orchestrator parses your last message. + - `arbitration`: 1-3 paragraphs. Resolve any disagreement between + specialists. Surface strategic implications (positioning, breaking + change, naming, scope). If specialists agreed and the change is + uncontroversial, say so plainly. + - `dissent_notes` (optional): when two or more panelists disagreed + on whether a finding is REQUIRED vs NIT, name the disagreement + and state which side you side with and why. + - `growth_signal` (optional): echo any side-channel note from the + oss-growth-hacker panelist that should be amplified in the + headline (conversion, narrative, breaking-change comms). +- You MUST NOT call `gh pr comment`, `gh pr edit`, `gh issue`, or any + other GitHub write command. You MUST NOT post to `safe-outputs`. + The orchestrator is the sole writer. diff --git a/.apm/agents/auth-expert.agent.md b/.apm/agents/auth-expert.agent.md index cd3331860..162f741b9 100644 --- a/.apm/agents/auth-expert.agent.md +++ b/.apm/agents/auth-expert.agent.md @@ -55,3 +55,33 @@ When reviewing or writing auth code: - Classic PATs (`ghp_`) work cross-org but are being deprecated -- prefer fine-grained - ADO uses Basic auth with base64-encoded `:PAT` -- different from GitHub bearer token flow - ADO also supports AAD bearer tokens via `az account get-access-token` (resource `499b84ac-1321-427f-aa17-267ca6975798`); precedence is `ADO_APM_PAT` -> az bearer -> fail. Stale PATs (401) silently fall back to the bearer with a `[!]` warning. See the auth skill for the four diagnostic cases. + +## Output contract when invoked by apm-review-panel + +When the apm-review-panel skill spawns you as a panelist task, you +operate under these strict rules. They override any default behavior +that would post comments or apply labels. + +- You read the persona scope above and the PR title/body/diff passed + in the task prompt. +- You produce findings in TWO buckets only: + - `required`: blocks merge. Real, actionable, citing file/line where + possible. Anything you put here will produce a REJECT verdict. + - `nits`: one-line suggestions the author can skip. No third bucket, + no "consider", no "optional follow-up". If a finding is real and + matters, it is required. If not, it is a nit. +- You return JSON matching `assets/panelist-return-schema.json` from + the apm-review-panel skill, as the FINAL message of your task. No + prose around the JSON; the orchestrator parses your last message. +- You MUST NOT call `gh pr comment`, `gh pr edit`, `gh issue`, or any + other GitHub write command. You MUST NOT post to `safe-outputs`. + You MUST NOT touch the PR state. The orchestrator is the sole + writer; your only output channel is the JSON return. +- If you have nothing blocking AND nothing worth nitting, return + `{persona: "", required: [], nits: []}`. That is a + valid and preferred answer when true. +- Auth-specific: when the apm-review-panel orchestrator spawns you + with "active=false" framing (the conditional rule did not fire), you + return `{persona: "auth-expert", active: false, inactive_reason: + "", required: [], nits: []}` + WITHOUT performing a full review. Trust the orchestrator's routing. diff --git a/.apm/agents/cli-logging-expert.agent.md b/.apm/agents/cli-logging-expert.agent.md index f6dca5b8b..c11e08b05 100644 --- a/.apm/agents/cli-logging-expert.agent.md +++ b/.apm/agents/cli-logging-expert.agent.md @@ -48,3 +48,28 @@ You are an expert on CLI output UX with excellent taste. You ensure verbose mode 3. **Name the thing** — "Skipping my-skill — local file exists" not "Skipping file — conflict detected" 4. **Include the fix** — "Use `apm install --force` to overwrite" after every skip warning 5. **No emojis** — ASCII `STATUS_SYMBOLS` only, never emoji characters + +## Output contract when invoked by apm-review-panel + +When the apm-review-panel skill spawns you as a panelist task, you +operate under these strict rules. They override any default behavior +that would post comments or apply labels. + +- You read the persona scope above and the PR title/body/diff passed + in the task prompt. +- You produce findings in TWO buckets only: + - `required`: blocks merge. Real, actionable, citing file/line where + possible. Anything you put here will produce a REJECT verdict. + - `nits`: one-line suggestions the author can skip. No third bucket, + no "consider", no "optional follow-up". If a finding is real and + matters, it is required. If not, it is a nit. +- You return JSON matching `assets/panelist-return-schema.json` from + the apm-review-panel skill, as the FINAL message of your task. No + prose around the JSON; the orchestrator parses your last message. +- You MUST NOT call `gh pr comment`, `gh pr edit`, `gh issue`, or any + other GitHub write command. You MUST NOT post to `safe-outputs`. + You MUST NOT touch the PR state. The orchestrator is the sole + writer; your only output channel is the JSON return. +- If you have nothing blocking AND nothing worth nitting, return + `{persona: "", required: [], nits: []}`. That is a + valid and preferred answer when true. diff --git a/.apm/agents/devx-ux-expert.agent.md b/.apm/agents/devx-ux-expert.agent.md index 811eac7b8..25b10e0e2 100644 --- a/.apm/agents/devx-ux-expert.agent.md +++ b/.apm/agents/devx-ux-expert.agent.md @@ -78,3 +78,28 @@ When reviewing a command, command help text, or a workflow change, ask: expert when a UX change touches auth, lockfile integrity, or download paths. - Strategic naming / positioning calls escalate to the APM CEO. + +## Output contract when invoked by apm-review-panel + +When the apm-review-panel skill spawns you as a panelist task, you +operate under these strict rules. They override any default behavior +that would post comments or apply labels. + +- You read the persona scope above and the PR title/body/diff passed + in the task prompt. +- You produce findings in TWO buckets only: + - `required`: blocks merge. Real, actionable, citing file/line where + possible. Anything you put here will produce a REJECT verdict. + - `nits`: one-line suggestions the author can skip. No third bucket, + no "consider", no "optional follow-up". If a finding is real and + matters, it is required. If not, it is a nit. +- You return JSON matching `assets/panelist-return-schema.json` from + the apm-review-panel skill, as the FINAL message of your task. No + prose around the JSON; the orchestrator parses your last message. +- You MUST NOT call `gh pr comment`, `gh pr edit`, `gh issue`, or any + other GitHub write command. You MUST NOT post to `safe-outputs`. + You MUST NOT touch the PR state. The orchestrator is the sole + writer; your only output channel is the JSON return. +- If you have nothing blocking AND nothing worth nitting, return + `{persona: "", required: [], nits: []}`. That is a + valid and preferred answer when true. diff --git a/.apm/agents/oss-growth-hacker.agent.md b/.apm/agents/oss-growth-hacker.agent.md index dbe67877f..7cad1a8e5 100644 --- a/.apm/agents/oss-growth-hacker.agent.md +++ b/.apm/agents/oss-growth-hacker.agent.md @@ -97,3 +97,28 @@ The CEO consumes your annotations when making the final call. - You write only to `WIP/growth-strategy.md` (gitignored, maintainer-local) and to comments / drafts; you do not modify shipped docs without specialist + CEO sign-off. Never stage or commit anything under `WIP/`. + +## Output contract when invoked by apm-review-panel + +When the apm-review-panel skill spawns you as a panelist task, you +operate under these strict rules. They override any default behavior +that would post comments or apply labels. + +- You read the persona scope above and the PR title/body/diff passed + in the task prompt. +- You produce findings in TWO buckets only: + - `required`: blocks merge. Real, actionable, citing file/line where + possible. Anything you put here will produce a REJECT verdict. + - `nits`: one-line suggestions the author can skip. No third bucket, + no "consider", no "optional follow-up". If a finding is real and + matters, it is required. If not, it is a nit. +- You return JSON matching `assets/panelist-return-schema.json` from + the apm-review-panel skill, as the FINAL message of your task. No + prose around the JSON; the orchestrator parses your last message. +- You MUST NOT call `gh pr comment`, `gh pr edit`, `gh issue`, or any + other GitHub write command. You MUST NOT post to `safe-outputs`. + You MUST NOT touch the PR state. The orchestrator is the sole + writer; your only output channel is the JSON return. +- If you have nothing blocking AND nothing worth nitting, return + `{persona: "", required: [], nits: []}`. That is a + valid and preferred answer when true. diff --git a/.apm/agents/python-architect.agent.md b/.apm/agents/python-architect.agent.md index ebedc9af5..1ddc12662 100644 --- a/.apm/agents/python-architect.agent.md +++ b/.apm/agents/python-architect.agent.md @@ -223,3 +223,28 @@ Rules for this subsection: over-engineering, write "Pragmatic suggestion: none -- the current shape is the simplest correct design at this scope." That is a valid and preferred answer when true. + +## Output contract when invoked by apm-review-panel + +When the apm-review-panel skill spawns you as a panelist task, you +operate under these strict rules. They override any default behavior +that would post comments or apply labels. + +- You read the persona scope above and the PR title/body/diff passed + in the task prompt. +- You produce findings in TWO buckets only: + - `required`: blocks merge. Real, actionable, citing file/line where + possible. Anything you put here will produce a REJECT verdict. + - `nits`: one-line suggestions the author can skip. No third bucket, + no "consider", no "optional follow-up". If a finding is real and + matters, it is required. If not, it is a nit. +- You return JSON matching `assets/panelist-return-schema.json` from + the apm-review-panel skill, as the FINAL message of your task. No + prose around the JSON; the orchestrator parses your last message. +- You MUST NOT call `gh pr comment`, `gh pr edit`, `gh issue`, or any + other GitHub write command. You MUST NOT post to `safe-outputs`. + You MUST NOT touch the PR state. The orchestrator is the sole + writer; your only output channel is the JSON return. +- If you have nothing blocking AND nothing worth nitting, return + `{persona: "", required: [], nits: []}`. That is a + valid and preferred answer when true. diff --git a/.apm/agents/supply-chain-security-expert.agent.md b/.apm/agents/supply-chain-security-expert.agent.md index 4c8a2eafd..3fc055532 100644 --- a/.apm/agents/supply-chain-security-expert.agent.md +++ b/.apm/agents/supply-chain-security-expert.agent.md @@ -94,3 +94,28 @@ file integration, ask: trade-off to the DevX UX expert and escalate to the CEO. - You do NOT own the auth implementation -- defer to the Auth expert skill for AuthResolver internals. + +## Output contract when invoked by apm-review-panel + +When the apm-review-panel skill spawns you as a panelist task, you +operate under these strict rules. They override any default behavior +that would post comments or apply labels. + +- You read the persona scope above and the PR title/body/diff passed + in the task prompt. +- You produce findings in TWO buckets only: + - `required`: blocks merge. Real, actionable, citing file/line where + possible. Anything you put here will produce a REJECT verdict. + - `nits`: one-line suggestions the author can skip. No third bucket, + no "consider", no "optional follow-up". If a finding is real and + matters, it is required. If not, it is a nit. +- You return JSON matching `assets/panelist-return-schema.json` from + the apm-review-panel skill, as the FINAL message of your task. No + prose around the JSON; the orchestrator parses your last message. +- You MUST NOT call `gh pr comment`, `gh pr edit`, `gh issue`, or any + other GitHub write command. You MUST NOT post to `safe-outputs`. + You MUST NOT touch the PR state. The orchestrator is the sole + writer; your only output channel is the JSON return. +- If you have nothing blocking AND nothing worth nitting, return + `{persona: "", required: [], nits: []}`. That is a + valid and preferred answer when true. diff --git a/.apm/skills/apm-review-panel/SKILL.md b/.apm/skills/apm-review-panel/SKILL.md index 1a5dae18e..f2bdf92ce 100644 --- a/.apm/skills/apm-review-panel/SKILL.md +++ b/.apm/skills/apm-review-panel/SKILL.md @@ -1,59 +1,99 @@ --- name: apm-review-panel description: >- - Use this skill to run a six-agent panel review (plus one conditional - auth specialist) for non-trivial PRs, design proposals, release - decisions, and other cross-cutting changes in microsoft/apm. Emit one - synthesized verdict comment. + Use this skill to run a multi-persona expert panel review on a labelled + pull request in microsoft/apm. The panel fans out to five mandatory + specialists plus one conditional auth specialist, all running in their + own agent threads, and a CEO synthesizer. The orchestrator is the sole + writer to the PR: one comment plus exactly one verdict label + (panel-approved or panel-rejected, derived deterministically from the + aggregated findings). Activate when a non-trivial PR needs cross-cutting + review (architecture, CLI logging, DevX UX, supply-chain security, + growth/positioning, optionally auth, with CEO arbitration). --- -# APM Review Panel -- Expert Review Orchestration - -The panel is fixed at **5 mandatory specialist lenses + 1 conditional -auth lens + 1 arbiter lens = up to 7 persona sections in one verdict -comment**. You play each lens in turn from inside a single agent loop -(progressive-disclosure skill model -- no sub-agent dispatch). Routing -chooses *which* lenses execute; it never changes which headings appear -in the final verdict. +# APM Review Panel - Fan-Out Expert Review + +The panel is FAN-OUT + SYNTHESIZER. Each persona runs in its own agent +thread (via the `task` tool) and returns JSON matching +`assets/panelist-return-schema.json`. The orchestrator schema-validates +each return, hands all returns to the apm-ceo synthesizer (also a task +thread, returns JSON matching `assets/ceo-return-schema.json`), then +deterministically derives the verdict and writes ONE comment plus ONE +verdict label. + +## Architecture invariants + +- **Verdict is binary and deterministic.** APPROVE iff + `sum(len(p.required) for p in panelists if p.active) == 0`. + REJECT otherwise. The CEO does NOT pick the verdict; the CEO writes + the arbitration narrative. This kills the "approve with reservations" + failure mode at the schema level. +- **Two severity buckets only.** `required` blocks merge. `nits` are + one-line suggestions the author can skip. There is no "optional", + no "consider", no "maybe later". If a finding is real and matters, + it is required. If not, it is a nit. No third bucket accumulates debt. +- **Single-writer interlock.** Only the orchestrator writes to + `safe-outputs` (one `add-comment`, one `add-labels`, one + `remove-labels`). Panelist subagents and the CEO subagent return + JSON only and MUST NOT call any `gh` write command, post comments, + apply labels, or touch the PR state. +- **Single-emission discipline.** Exactly one comment per panel run, + rendered from `assets/verdict-template.md` after all subagents return. ## Agent roster -| Agent | Persona | Always active? | -|-------|---------|----------------| +| Agent | Role | Always active? | +|-------|------|----------------| | [Python Architect](../../agents/python-architect.agent.md) | Architectural Reviewer | Yes | | [CLI Logging Expert](../../agents/cli-logging-expert.agent.md) | Output UX Reviewer | Yes | | [DevX UX Expert](../../agents/devx-ux-expert.agent.md) | Package-Manager UX | Yes | | [Supply Chain Security Expert](../../agents/supply-chain-security-expert.agent.md) | Threat-Model Reviewer | Yes | -| [OSS Growth Hacker](../../agents/oss-growth-hacker.agent.md) | Adoption Strategist | Yes (side-channel to CEO) | -| [Auth Expert](../../agents/auth-expert.agent.md) | Auth / Token Reviewer | Conditional (see "Conditional panelist" below) | -| [APM CEO](../../agents/apm-ceo.agent.md) | Strategic Owner / Arbiter | Yes (always arbitrates) | +| [OSS Growth Hacker](../../agents/oss-growth-hacker.agent.md) | Adoption Strategist | Yes | +| [Auth Expert](../../agents/auth-expert.agent.md) | Auth / Token Reviewer | Conditional (see below) | +| [APM CEO](../../agents/apm-ceo.agent.md) | Strategic Arbiter / Synthesizer | Yes | -## Routing topology +## Topology ``` - python-architect cli-logging-expert devx-ux-expert supply-chain-security-expert - \_______________________|______________________________/ - | <-- auth-expert (conditional) - v - apm-ceo <---- oss-growth-hacker - (final call / arbiter) (annotates findings; - updates growth-strategy) + apm-review-panel SKILL (orchestrator thread) + | + FAN-OUT via task tool (panelists in parallel) + | + +-----+-------+-------+-----+-----+------+ + v v v v v v v (cond.) + py cli dx-ux sec grw auth + | | | | | | + | each returns JSON per panelist-return-schema.json + +-----+-------+-------+-----+-------------+ + | + v <-- S4 schema-validate + v <-- on malformed: re-spawn that persona + v + task: apm-ceo synthesizer + - aggregates required[] across panelists + - resolves dissent + - returns ceo-return-schema.json (NO verdict field) + | + v <-- DETERMINISTIC verdict gate + | APPROVE iff sum(required) == 0 + | REJECT otherwise + v + orchestrator (sole writer) + | | | + v v v + add-comment add-labels remove-labels + (max:2) [panel-approved [panel-review] + XOR panel-rejected] ``` -- **Specialists raise findings independently** -- no implicit consensus. -- **CEO arbitrates** when specialists disagree or when a finding has - strategic implications (positioning, breaking change, naming, scope). -- **Growth Hacker is a side-channel** to the CEO: never blocks a - specialist finding; annotates it with growth implications and - escalates to the CEO when relevant. - ## Conditional panelist: Auth Expert Auth Expert is the only conditional panelist. Activate `auth-expert` if either rule below matches. -1. **Fast-path file trigger.** Activate the Auth Expert lens - immediately when the PR changes any of: +1. **Fast-path file trigger.** Activate the Auth Expert task when the + PR changes any of: - `src/apm_cli/core/auth.py` - `src/apm_cli/core/token_manager.py` - `src/apm_cli/core/azure_cli.py` @@ -64,7 +104,7 @@ if either rule below matches. - `src/apm_cli/deps/registry_proxy.py` 2. **Fallback self-check.** If no fast-path file matched, answer this - before activating the lens: + before activating: > Does this PR change authentication behavior, token management, > credential resolution, host classification used by `AuthResolver`, @@ -74,190 +114,159 @@ if either rule below matches. Routing rule: -- **YES** -> take the Auth Expert lens (per the Persona pass - procedure) and capture its findings. -- **NO** -> record `Auth Expert inactive reason: ` in working notes; do not take the lens. +- **YES** -> spawn the auth-expert task with the standard panelist + contract. It returns JSON with `active: true`. +- **NO** -> spawn the auth-expert task ANYWAY but instruct it to + return `{persona: "auth-expert", active: false, inactive_reason: + "", required: [], nits: []}`. + This keeps the schema uniform and the verdict template happy. - Never use wildcard heuristics like `*auth*`, `*token*`, or `*credential*` as the sole trigger. -## Routing matrix - -These routes choose which specialists are emphasised for a given PR -type. They do **not** change the verdict shape. The final comment -always uses every persona heading in `assets/verdict-template.md`; -the only persona that can render as `Not activated -- ` is -Auth Expert (per the conditional rule above). - -### Code review (architecture + logging) -1. Python Architect reviews structure / patterns / cross-file impact. -2. CLI Logging Expert reviews any output / logger changes. -3. CEO ratifies if the two disagree on abstraction vs consistency. - -### CLI UX review -1. DevX UX Expert reviews command surface, flags, help, error wording. -2. CLI Logging Expert reviews how outputs are emitted (logger methods). -3. Growth Hacker annotates if the change affects first-run conversion. -4. CEO ratifies any naming / positioning calls. - -### Security review -1. Supply Chain Security Expert maps the change to the threat model. -2. DevX UX Expert flags any ergonomics regression from the mitigation. -3. CEO arbitrates trade-offs; bias toward security on default behavior. - -### Auth review (only when the conditional Auth Expert is activated) -1. Auth Expert maps the change against AuthResolver, token precedence, - host classification, and credential helpers. -2. Supply Chain Security Expert checks for token-scoping or credential - leakage implications. -3. CEO ratifies any default-behavior change. - -### Release / comms review -1. CEO grounds the release framing in `gh` CLI stats. -2. Growth Hacker drafts hook + story angle; updates - `WIP/growth-strategy.md` (gitignored maintainer-local; create if absent). -3. Specialists sanity-check any technical claims in release notes. - -### Full panel review (non-trivial change) -1. Each mandatory specialist produces independent findings. -2. Auth Expert participates if the conditional rule above activates it. -3. Growth Hacker annotates findings with growth implications. -4. CEO synthesizes, resolves disagreements, makes the final call. -5. Surface decision and rationale to the author via the single verdict - comment. - -## Quality gates - -A non-trivial change passes when: - -- [ ] Python Architect: structure / patterns OK (or change explicitly - justified) -- [ ] CLI Logging Expert: output paths route through CommandLogger, - no direct `_rich_*` in commands -- [ ] DevX UX Expert: command surface familiar to npm/pip/cargo users, - every error has a next action -- [ ] Supply Chain Security Expert: no new path / auth / integrity - surface left unguarded; fails closed -- [ ] Auth Expert (only if activated): no regression to AuthResolver - precedence, host classification, or credential leakage surface -- [ ] APM CEO: trade-offs ratified, breaking changes have CHANGELOG + - migration line -- [ ] OSS Growth Hacker: conversion surfaces unaffected or improved; - `WIP/growth-strategy.md` updated if relevant (maintainer-local; - gitignored, never committed) - -## Notes - -- This skill orchestrates a panel **in your own context** -- you are - the only agent. You load each persona's `.agent.md` reference file - on demand (progressive disclosure), assume that persona's lens to - produce its findings, then move to the next persona. Do NOT spawn - sub-agents (no `task` tool dispatch) -- the panel is a sequence of - reasoning passes inside one agent loop, not a multi-agent fan-out. -- Persona detail lives in the linked `.agent.md` files. Read each - one when you switch to that persona; do not pre-load all of them. +## Routing matrix (emphasis only - all panelists always run) + +These routes describe WHICH specialist's findings the CEO weights more +heavily for a given PR type. They do NOT change which personas run - +every mandatory persona runs on every panel invocation, period. Routing +is a CEO synthesis hint, not a panelist gate. + +- **Architecture-heavy PR** -> CEO weights Python Architect findings + on abstraction calls; CLI Logging on consistency. +- **CLI UX PR** -> CEO weights DevX UX on command surface; CLI Logging + on output paths; Growth Hacker on first-run conversion. +- **Security PR** -> CEO biases toward Supply Chain Security on + default behavior; DevX UX flags ergonomics regression from any + mitigation. +- **Auth PR** (auth-expert active) -> CEO weights Auth Expert on + AuthResolver / token precedence; Supply Chain on token-scoping. +- **Release / comms PR** -> CEO weights Growth Hacker on hook + story + angle; specialists sanity-check technical claims. +- **Full panel** (default) -> CEO synthesizes equally; calls out any + dissent in `dissent_notes`. ## Execution checklist -When this skill is activated for a PR review, work through these -steps in order, in a single agent loop. Do not skip ahead and do not -emit any output before the final step. - -1. Read the PR context (title, body, labels, changed files, diff). - The orchestrating workflow already fetches this with `gh pr view` - / `gh pr diff` -- do not re-fetch. -2. Resolve the **Auth Expert conditional case** using the rule in - "Conditional panelist: Auth Expert" above. Record either an - activation decision (and proceed to step 3) or an `Auth Expert - inactive reason: ` in working notes. -3. For each mandatory persona (plus `auth-expert` if activated), - follow the **Persona pass procedure** below, one persona at a - time. Do not try to play multiple personas in a single pass. -4. Run the **pre-arbitration completeness gate**: - - Findings exist in working notes for the 5 mandatory specialists - (Python Architect, CLI Logging Expert, DevX UX Expert, Supply - Chain Security Expert, OSS Growth Hacker). - - Exactly one of `Auth Expert findings` or `Auth Expert inactive - reason` exists (neither = incomplete; both = inconsistent - routing). - - The Python Architect notes contain the OO / class mermaid - diagram, the execution-flow mermaid diagram, and the Design - patterns subsection declared in - `../../agents/python-architect.agent.md`. - - No persona section is missing or empty. - If any check fails, redo that persona's pass and repeat the gate. - Do not proceed to step 5 until the gate passes. -5. Take the **APM CEO** lens (load - `../../agents/apm-ceo.agent.md`) and arbitrate over the collected - findings -- still in your own context. CEO arbitration may run - only after the completeness gate has passed. -6. Now (and only now) load `assets/verdict-template.md` and fill it - in with the collected findings + arbitration. -7. Emit the filled template as exactly ONE comment via the workflow's - `safe-outputs.add-comment` channel. Never call the GitHub API - directly. This is the ONLY output emission for the entire panel - run -- no per-persona comments, no progress comments. - -### Persona pass procedure - -For each persona, run this exact procedure in your own context: - -1. Open the persona's `.agent.md` file (linked in the roster) and - read its scope, lens, anti-patterns, and required return shape. -2. From that persona's lens, review the PR title/body/diff against - the scope declared in the file. -3. Write the findings to working notes under - `: ` (or, for an inactive Auth Expert, - `Auth Expert inactive reason: `). -4. Drop the persona lens before moving on. Do not emit any comment - from inside a persona pass; persona findings stay in working - notes until step 7 synthesizes them. - -## Output contract - -This contract is non-negotiable -- it is the difference between a -panel review that lands as one cohesive verdict and one that fragments -into per-persona noise. - -- Produce **exactly one** comment per panel run. The - `safe-outputs.add-comment.max` cap in the workflow is a fail-soft - ceiling (currently 7, one per persona slot); the one-comment - discipline lives here. -- Use `assets/verdict-template.md` as the comment body. Keep its - section headings exactly as written. Adapt the body of each section - to the PR. Do not invent new top-level sections or drop existing - ones. -- CEO arbitration may run only after the completeness gate passes. -- Never emit findings as separate comments, intermediate progress - comments, or "I will now invoke X" status comments. -- Load `assets/verdict-template.md` **at synthesis time only** (step - 6 above) -- not at activation, not while collecting findings. +Work through these steps in order. Do not skip ahead. Do not emit any +output to the PR before step 6. + +1. **Read PR context** (the orchestrating workflow already fetched it + via `gh pr view` / `gh pr diff` in step 1 of `pr-review-panel.md`). + Identify changed files for the auth-expert routing decision. + +2. **Resolve the auth-expert conditional** using the rule above. + Decide: spawn auth-expert active, OR spawn auth-expert with + `active: false` + an `inactive_reason`. Either way, auth-expert + IS spawned - the schema requires uniform return shape. + +3. **Fan out panelist tasks.** Spawn the following tasks in PARALLEL + via the `task` tool, one task per persona: + - `python-architect` + - `cli-logging-expert` + - `devx-ux-expert` + - `supply-chain-security-expert` + - `oss-growth-hacker` + - `auth-expert` (always - active per step 2) + + Each task prompt MUST: + - Reference its persona file by relative path so the subagent loads + its own scope, lens, and anti-patterns. + - Include the PR number, title, body, and diff (passed inline). + - Cite `assets/panelist-return-schema.json` and require the + subagent to emit JSON matching that schema as its FINAL message. + - Restate the output contract: NO `gh` write commands, NO posting + comments, NO label changes, NO touching PR state. JSON return + only. + +4. **S4 schema gate.** When each panelist task returns, parse the JSON + and validate against `assets/panelist-return-schema.json`. On + validation failure: + - Re-spawn that ONE panelist with an explicit error message + pointing at the violated rule (e.g. "your previous return was + missing the `nits` field"). + - Maximum two re-spawn attempts per panelist. If still malformed, + synthesize a placeholder `{persona: "", active: true, + required: [], nits: [], extras: {schema_failure: ""}}` + and surface the failure in the CEO arbitration prompt. + +5. **Spawn the CEO synthesizer task.** Pass the full set of validated + panelist JSON returns to a `task` invocation that loads + `../../agents/apm-ceo.agent.md`. The prompt MUST: + - Provide all panelist returns as structured input. + - Ask for arbitration prose, dissent resolution, and any growth + signal worth amplifying. + - Cite `assets/ceo-return-schema.json` and require JSON return. + - Restate the contract: CEO does NOT pick verdict, only writes + arbitration. NO `gh` write commands. + + Validate the CEO return against `assets/ceo-return-schema.json`. + On failure, re-spawn once with the violation cited. + +6. **Compute verdict deterministically:** + + ``` + total_required = sum(len(p["required"]) for p in panelists if p.get("active", True)) + verdict = "APPROVE" if total_required == 0 else "REJECT" + verdict_label = "panel-approved" if verdict == "APPROVE" else "panel-rejected" + ``` + + The CEO does not influence this calculation. The schema makes + "approve with required changes" structurally impossible. + +7. **Render the comment.** Load `assets/verdict-template.md`, fill the + placeholders from the panelist + CEO JSON, and emit it as exactly + ONE comment via `safe-outputs.add-comment`. NEVER call the GitHub + API directly. NEVER emit per-persona comments or progress comments. + +8. **Apply the verdict label** via `safe-outputs.add-labels`: + `[verdict_label]` (one of `panel-approved` / `panel-rejected`). + +9. **Remove the trigger label** via `safe-outputs.remove-labels`: + `[panel-review]`. This makes the trigger idempotent - re-applying + the label will re-run the panel cleanly. + +## Output contract (non-negotiable) + +- Exactly ONE comment per panel run, top-loaded per + `assets/verdict-template.md`. The `safe-outputs.add-comment.max: 2` + is a fail-soft ceiling; the discipline lives here. +- Exactly ONE verdict label per panel run (`panel-approved` XOR + `panel-rejected`). +- Exactly ONE removal of the `panel-review` trigger label. +- Subagents (panelists + CEO) NEVER write to `safe-outputs`, NEVER + call `gh pr comment`, NEVER call `gh pr edit --add-label`. They + return JSON. The orchestrator is the sole writer. This discipline + is mirrored in each persona's "Output contract when invoked by + apm-review-panel" section. +- Never invent new top-level template sections or drop existing ones. ## Gotchas -- **Roster invariant.** The frontmatter description, the roster - table, the conditional-panelist rule, the verdict template, and the - quality gates MUST agree on the persona set. If you change one, - change all of them in the same edit. Description, roster, and - template are the three places drift most often appears. +- **Roster invariant.** The frontmatter description, the roster table, + the conditional rule, the verdict template, and the JSON schema + MUST agree on the persona set. If you change one, change all in the + same edit. - **False-negative auth gotcha.** Auth regressions can be introduced - from non-auth files that change the *inputs* to auth -- for - example host classification, dependency parsing, clone URL - construction, HTTP authorization headers, or call sites that - bypass `AuthResolver`. If a diff changes how a remote host, org, - token source, or fallback path is selected and you are not certain - it is auth-neutral, activate `auth-expert`. + from non-auth files that change the inputs to auth - host + classification, dependency parsing, clone URL construction, HTTP + authorization headers, or call sites that bypass `AuthResolver`. If + a diff changes how a remote host, org, token source, or fallback + path is selected and you are not certain it is auth-neutral, + activate auth-expert as `active: true`. - **Bundle layout on the runner.** When this skill runs inside the PR-review agentic workflow, the APM bundle is unpacked under - `.github/skills/apm-review-panel/` first, with `.apm/skills/...` - as a fallback. The asset path is the same relative to the skill - root (`assets/verdict-template.md`) in both layouts -- prefer the - `.github/...` path when present. -- **No multi-persona-in-one-pass.** Each persona has its own - `.agent.md` for a reason -- read it when you take that lens, write - the findings, then drop the lens before moving on. Trying to be all - personas in one reasoning pass is the most common cause of dropped - findings and mixed voices. -- **Single-emission discipline is fragile under interruption.** If - you find yourself wanting to "post a quick partial verdict and - then update it", don't. Buffer in working notes; emit once. + `.github/skills/apm-review-panel/` first, with `.apm/skills/...` as + a fallback. The asset paths are the same relative to the skill root + in both layouts. +- **Subagent write enforcement is contract-based, not sandbox-based.** + In gh-aw, tool permissions are workflow-scoped, not subagent-scoped, + so every spawned task technically inherits the same `gh` toolset. + The "subagents must not write" rule is enforced by the prompt + contract in each `.agent.md` plus the `safe-outputs.add-comment.max: + 2` fail-soft. If a subagent ever tries to post a comment, the cap + catches it; if it tries to bypass `safe-outputs` and call `gh` + directly, that is a contract violation worth surfacing in the next + panel review of the persona file itself. +- **Trigger-label removal is the LAST step.** Doing it earlier risks + another labeller racing the panel mid-run. The companion workflow + `pr-panel-label-reset.yml` handles verdict-label invalidation on + every new push (deterministically, no LLM). diff --git a/.apm/skills/apm-review-panel/assets/ceo-return-schema.json b/.apm/skills/apm-review-panel/assets/ceo-return-schema.json new file mode 100644 index 000000000..466aa01b5 --- /dev/null +++ b/.apm/skills/apm-review-panel/assets/ceo-return-schema.json @@ -0,0 +1,23 @@ +{ + "$schema": "http://json-schema.org/draft-07/schema#", + "$id": "ceo-return-schema.json", + "title": "APM Review Panel - CEO Synthesizer Return Shape", + "description": "Shape the apm-ceo persona MUST return when invoked as the panel synthesizer. CEO writes ARBITRATION PROSE only - never picks the verdict. The orchestrator derives verdict deterministically from aggregated panelist required-counts (APPROVE iff sum(required) == 0, REJECT otherwise).", + "type": "object", + "required": ["arbitration"], + "additionalProperties": false, + "properties": { + "arbitration": { + "type": "string", + "description": "One-to-three paragraph synthesis. Resolve any disagreement between specialists. Surface strategic implications (positioning, breaking change, naming, scope). If specialists agreed and the change is uncontroversial, say so plainly. ASCII only." + }, + "dissent_notes": { + "type": "string", + "description": "Optional. When two or more panelists disagreed on whether a finding is REQUIRED vs NIT, name the disagreement and state which side the CEO sided with and why. Empty / omitted when no dissent." + }, + "growth_signal": { + "type": "string", + "description": "Optional. CEO surface for the side-channel note from oss-growth-hacker (conversion, narrative, breaking-change comms). Empty / omitted when not relevant." + } + } +} diff --git a/.apm/skills/apm-review-panel/assets/panelist-return-schema.json b/.apm/skills/apm-review-panel/assets/panelist-return-schema.json new file mode 100644 index 000000000..89a095b6b --- /dev/null +++ b/.apm/skills/apm-review-panel/assets/panelist-return-schema.json @@ -0,0 +1,76 @@ +{ + "$schema": "http://json-schema.org/draft-07/schema#", + "$id": "panelist-return-schema.json", + "title": "APM Review Panel - Panelist Return Shape", + "description": "Shape every panel persona MUST return when invoked by the apm-review-panel skill. Two severity buckets only: required (blocks merge) and nits (skip-if-you-want, single line). NO third bucket. NO disposition - the orchestrator computes verdict deterministically from aggregated required-counts.", + "type": "object", + "required": ["persona", "required", "nits"], + "additionalProperties": false, + "properties": { + "persona": { + "type": "string", + "description": "Persona slug. MUST equal the .agent.md filename stem.", + "enum": [ + "python-architect", + "cli-logging-expert", + "devx-ux-expert", + "supply-chain-security-expert", + "oss-growth-hacker", + "auth-expert" + ] + }, + "active": { + "type": "boolean", + "description": "Set to false ONLY for auth-expert when the conditional rule does not activate it. All other personas MUST set active=true. When false, required and nits MUST be empty arrays and inactive_reason MUST be a one-sentence explanation citing the touched files.", + "default": true + }, + "inactive_reason": { + "type": "string", + "description": "Required when active=false. One sentence citing the touched files." + }, + "required": { + "type": "array", + "description": "Findings that MUST be addressed before merge. Empty array means this persona has no blocking concerns. Anything in this array causes the orchestrator to emit a REJECT verdict.", + "items": { "$ref": "#/definitions/finding" } + }, + "nits": { + "type": "array", + "description": "One-line suggestions the author can skip. NEVER block merge. No cap - trust the panelist's judgement.", + "items": { "$ref": "#/definitions/finding" } + }, + "extras": { + "type": "object", + "description": "Persona-specific structured payload. Reserved for python-architect mermaid diagrams and oss-growth-hacker side-channel notes. Never blocks; never affects verdict.", + "additionalProperties": true + } + }, + "definitions": { + "finding": { + "type": "object", + "required": ["summary", "rationale"], + "additionalProperties": false, + "properties": { + "summary": { + "type": "string", + "description": "One-line description of the finding. ASCII only. No emojis." + }, + "rationale": { + "type": "string", + "description": "WHY this matters. Cite the rule or pattern violated. ASCII only." + }, + "file": { + "type": "string", + "description": "Optional repo-relative path to the file." + }, + "line": { + "type": "integer", + "description": "Optional line number." + }, + "suggestion": { + "type": "string", + "description": "Optional concrete fix (diff hint, replacement code, command to run)." + } + } + } + } +} diff --git a/.apm/skills/apm-review-panel/assets/verdict-template.md b/.apm/skills/apm-review-panel/assets/verdict-template.md index ce032bdde..49b18d663 100644 --- a/.apm/skills/apm-review-panel/assets/verdict-template.md +++ b/.apm/skills/apm-review-panel/assets/verdict-template.md @@ -1,62 +1,110 @@ -## APM Review Panel Verdict +## APM Review Panel Verdict: -**Disposition**: +> ---- +### Required before merge ( items) -### Per-persona findings + +None. + +- [] :`> + - Why: + - > +- ... -**Python Architect**: +### Nits ( items, skip if you want) -**CLI Logging Expert**: + +None. + +- [] :`> +- ... -**DevX UX Expert**: +### CEO arbitration -**Supply Chain Security Expert**: + -**Auth Expert**: "> + +**Dissent resolved:** -**OSS Growth Hacker**: + +**Growth/positioning note:** --- -### CEO arbitration +
+Per-persona findings (full) - +#### Python Architect ---- + + + -### Required actions before merge +#### CLI Logging Expert -1. -2. <...> + ---- +#### DevX UX Expert + + + +#### Supply Chain Security Expert + + + +#### Auth Expert + + + + +Inactive -- + +#### OSS Growth Hacker + + -### Optional follow-ups +
-- -- <...> +Verdict computed deterministically: required findings across + active panelists. APPROVE iff N == 0. Push a new commit to clear +this verdict label automatically. diff --git a/.apm/skills/apm-review-panel/evals/README.md b/.apm/skills/apm-review-panel/evals/README.md new file mode 100644 index 000000000..04ae1503b --- /dev/null +++ b/.apm/skills/apm-review-panel/evals/README.md @@ -0,0 +1,67 @@ +# Evals: apm-review-panel + +Per genesis Step 8 evals gate. Two categories: + +1. **TRIGGER EVALS** validate the dispatch description correctly + discriminates should-trigger queries from near-miss should-NOT + queries. Validation split is the ship gate (>= 0.5 on positives, + < 0.5 on negatives). + +2. **CONTENT EVALS** validate that the skill, when activated, produces + the JSON-derived top-loaded verdict comment in the shape declared + by `assets/verdict-template.md`. Run with-skill vs without-skill; + if the deltas are not visible, the skill is not adding value. + +## Files + +- `trigger-evals.json` -- 16 queries (8 should-trigger + 8 should-NOT), + 60/40 train/val split. +- `content-eval-clean-pr.md` -- synthetic clean PR scenario; expected + verdict = APPROVE, all panelists return `required: []`. +- `content-eval-rejected-pr.md` -- synthetic PR with one architectural + smell + one nit; expected verdict = REJECT, python-architect returns + one `required` finding. +- `run-verdict-harness.py` -- DETERMINISTIC harness covering the + parts of the panel that do NOT require an LLM: JSON schema + validation (S4 gate) and verdict computation. Five cases: two + positive (clean-pr APPROVE, rejected-pr REJECT) and three negative + (missing-nits, unknown-persona, disposition-leak). All five MUST + pass before merging any change to schemas, the SKILL.md execution + checklist verdict rule, or the persona output contracts. + +## How to run + +### Deterministic harness (free, runs anywhere) + +``` +uv run --with jsonschema python3 \ + .apm/skills/apm-review-panel/evals/run-verdict-harness.py +``` + +Expected: `RESULT: ALL PASS`. Proves schemas reject malformed shapes +and verdict math is correct. Does NOT prove the LLM panelists will +return well-formed JSON in practice -- that requires a real run. + +### Full end-to-end (option B, branch-pin test) + +The gh-aw workflow imports the panel skill from `microsoft/apm#main` +for security (anti-self-approval). Pre-merge validation requires a +temporary branch pin: + +1. On the feature branch, change `imports.packages` in + `.github/workflows/pr-review-panel.md` from `microsoft/apm#main` + to `microsoft/apm#`. +2. `gh aw compile pr-review-panel`. +3. Commit as `chore: TEMP pin to branch for end-to-end test`. +4. Open a tiny throwaway PR; label it `panel-review`. +5. Observe: 6 task threads spawn, JSON returns, verdict label + applied, `panel-review` removed, `panel-approved` (or + `panel-rejected`) set. Push a commit and watch the + deterministic `pr-panel-label-reset.yml` strip the verdict + label. +6. Revert the temp-pin commit before merge. + +### Trigger evals + +Trigger evals can be run via the genesis evals harness or any +dispatcher that loads the skill description and queries it. diff --git a/.apm/skills/apm-review-panel/evals/content-eval-clean-pr.md b/.apm/skills/apm-review-panel/evals/content-eval-clean-pr.md new file mode 100644 index 000000000..4a3651dee --- /dev/null +++ b/.apm/skills/apm-review-panel/evals/content-eval-clean-pr.md @@ -0,0 +1,49 @@ +# Content eval: clean PR (expected APPROVE) + +## Synthetic PR input + +- Title: `docs(readme): clarify install command for fish shell users` +- Body: Adds a one-paragraph note in README.md under the install + section pointing fish users to the existing `eval (apm shellenv)` + recipe. No code changes. +- Diff: 4 added lines, 0 removed, 1 file (`README.md`). +- Files: `README.md` + +## Expected orchestrator behavior + +- Spawn 6 panelist tasks in parallel (5 mandatory + auth-expert with + `active: false` because no auth files touched). +- Each panelist returns `{persona: "", required: [], nits: [...]}` + where `nits` may contain at most a one-line wording suggestion. +- auth-expert returns `{persona: "auth-expert", active: false, + inactive_reason: "...README.md has no auth surface...", required: [], + nits: []}`. +- CEO synthesizer returns `{arbitration: "Documentation-only change + ... all specialists agree no required actions ...", dissent_notes + not present}`. +- Orchestrator computes `total_required = 0` -> verdict = APPROVE. +- Orchestrator emits ONE comment per `assets/verdict-template.md`: + - Header: `## APM Review Panel Verdict: APPROVE` + - Required section: "None." + - Nits section: rendered (or "None.") + - CEO arbitration paragraph + - Per-persona detail in collapsed `
` block +- Orchestrator applies `panel-approved` label. +- Orchestrator removes `panel-review` label. + +## Pass criteria + +- Comment header literally reads `APM Review Panel Verdict: APPROVE`. +- Required section reads "None.". +- Verdict label `panel-approved` is on the PR. +- `panel-review` label is no longer on the PR. +- Comment count from this run is exactly 1. + +## With-skill vs without-skill delta + +- With skill: structured top-loaded comment, deterministic verdict + label, fan-out across 6 distinct lenses. +- Without skill: free-form single-voice review with no label + automation, no schema, no guarantee on which lenses got applied. + +The structure IS the value. diff --git a/.apm/skills/apm-review-panel/evals/content-eval-rejected-pr.md b/.apm/skills/apm-review-panel/evals/content-eval-rejected-pr.md new file mode 100644 index 000000000..83a2f498b --- /dev/null +++ b/.apm/skills/apm-review-panel/evals/content-eval-rejected-pr.md @@ -0,0 +1,62 @@ +# Content eval: PR with architectural smell + nit (expected REJECT) + +## Synthetic PR input + +- Title: `feat(install): add retry to package downloader` +- Body: Wraps `_download_package` in a 3-attempt retry loop with + fixed 2-second backoff. Adds a `retries` parameter to + `InstallPipeline.__init__` defaulting to 3. +- Diff: 35 added lines, 2 removed, 2 files. +- Files: + - `src/apm_cli/install/pipeline.py` (retry param wiring) + - `src/apm_cli/deps/github_downloader.py` (retry loop body inline + in `_download_package`) + +## Expected orchestrator behavior + +- Spawn 6 panelist tasks in parallel. +- Auth Expert activates (`github_downloader.py` is a fast-path file) + and returns `active: true` with possibly one finding about token + refresh on 401. +- Python Architect returns at least one `required` finding: + `{summary: "Inline retry loop violates Strategy pattern used + elsewhere in deps/", file: "src/apm_cli/deps/github_downloader.py", + rationale: "Codebase already has the chain-of-responsibility + retry pattern in AuthResolver; this should reuse it or extract a + shared retry helper rather than inlining a fixed loop"}`. +- CLI Logging Expert returns one `nit`: missing `[!]` warning when + retry kicks in. +- DevX UX Expert returns one `nit`: `retries=3` is undocumented in + CLI help. +- Supply Chain Security Expert returns 0 findings (no integrity + surface affected). +- OSS Growth Hacker returns 0 findings. +- CEO synthesizer returns `{arbitration: "Retry behavior is sound + but the inline loop misses the codebase's existing retry + abstraction. Python Architect's required action holds.", + dissent_notes not present}`. +- Orchestrator computes `total_required = 1` -> verdict = REJECT. +- Orchestrator emits ONE comment: + - Header: `## APM Review Panel Verdict: REJECT` + - Required section: 1 item, prefixed `[python-architect]`. + - Nits section: 2 items, prefixed `[cli-logging-expert]` and + `[devx-ux-expert]`. + - CEO arbitration paragraph. + - Per-persona detail in `
`. +- Orchestrator applies `panel-rejected` label. +- Orchestrator removes `panel-review` label. + +## Pass criteria + +- Comment header literally reads `APM Review Panel Verdict: REJECT`. +- Required section has exactly 1 item with `[python-architect]` prefix. +- Verdict label `panel-rejected` is on the PR. +- `panel-review` label is no longer on the PR. +- The auth-expert per-persona block renders (active = true). + +## With-skill vs without-skill delta + +Same as the clean-PR eval: with skill, the verdict is structured and +machine-actionable (label + binary outcome). Without skill, there is +no schema, no required-vs-nit discipline, and the developer cannot +tell at a glance whether the comment blocks merge or not. diff --git a/.apm/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md b/.apm/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md new file mode 100644 index 000000000..e0fdec31a --- /dev/null +++ b/.apm/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md @@ -0,0 +1,71 @@ +# End-to-end eval result: PR #931 (2026-04-28) + +Real gh-aw run of the refactored panel against an unrelated open PR +in microsoft/apm. Branch-pin method (option B) -- workflow YAML +imported the panel skill from `microsoft/apm#refactor/review-panel-fanout` +instead of `#main`. + +## Setup + +- Workflow run: https://github.com/microsoft/apm/actions/runs/25069734881 +- Target PR: #931 (`docs(compile): clarify target routing docstring`, + +7/-3, no labels, external contributor) +- Trigger: `workflow_dispatch` from `refactor/review-panel-fanout` + with `pr_number=931` +- Skill source: `.apm/skills/apm-review-panel/SKILL.md` from this branch +- gh-aw version: v0.68.3 + +## Job graph (all SUCCESS) + +``` +pre_activation -> activation -> apm -> agent -> detection -> safe_outputs -> conclusion +``` + +All 7 jobs completed successfully on first run. + +## Acceptance criteria + +| # | Criterion | Result | +|---|---|---| +| 1 | Verdict header literally reads `## APM Review Panel Verdict: ` | PASS -- `## APM Review Panel Verdict: REJECT` | +| 2 | Top-loaded order: verdict -> required -> nits -> CEO -> per-persona | PASS -- exact order observed | +| 3 | Required findings rendered as `[persona] summary` + Why + Suggested fix | PASS -- 3 required items, all 3 panelists, all with file/line citations | +| 4 | Per-persona detail in collapsed `
` block at bottom | PASS -- `
Per-persona findings (full)...` | +| 5 | All 6 personas reported back (5 mandatory + auth-expert) | PASS -- python-architect, cli-logging-expert, devx-ux-expert, supply-chain-security-expert, auth-expert, oss-growth-hacker | +| 6 | Auth-expert correctly inactive when no auth surface touched | PASS -- "Inactive -- PR #931 modifies only a docstring ... no auth, token, credential, host-classification, or AuthResolver-related files are touched." | +| 7 | Verdict deterministic from `required[]` count (3 > 0 -> REJECT) | PASS -- correct mapping observed | +| 8 | Verdict label `panel-approved` XOR `panel-rejected` applied | PASS -- `panel-rejected` added | +| 9 | Comment count from this run = 1 (max:2 ceiling honored) | PASS -- 1 comment | +| 10 | Python Architect class diagram in extras passed through to per-persona block | PASS -- mermaid `classDiagram` rendered in details | + +## Surprise findings (positive) + +- The panel surfaced a REAL regression in PR #931 that prior reviewers + had not flagged: the proposed docstring silently drops `gemini` + routing that exists in the implementation. Three independent + panelists converged on the same root cause with zero dissent. The + CEO arbitration explicitly noted this convergence and recommended a + rebase-and-revise rather than a close. +- Growth Hacker correctly identified the contributor as external + (WilliamK112) and recommended a "warm rejection that models a + low-friction contribution loop" -- the fan-out captured a strategic + signal that a single-loop reviewer would have missed. + +## Failure modes NOT observed (negative confirmation) + +- No "approve with reservations" wording -- verdict is binary as + designed. +- No panelist comment posted to the PR -- only the orchestrator's + single synthesized comment. The single-writer interlock held. +- No `panel-review` label cleanup needed (the run was triggered via + workflow_dispatch, not via label, so there was no `panel-review` + label to remove). The label-removal path will be exercised when the + first real `panel-review`-labelled PR runs after merge. + +## Total cost + +One gh-aw workflow run, ~12 minutes wall-clock. No retries needed. + +## Artifact + +Full panel comment: https://github.com/microsoft/apm/pull/931#issuecomment- diff --git a/.apm/skills/apm-review-panel/evals/results-label-reset-2026-04-28.md b/.apm/skills/apm-review-panel/evals/results-label-reset-2026-04-28.md new file mode 100644 index 000000000..022548659 --- /dev/null +++ b/.apm/skills/apm-review-panel/evals/results-label-reset-2026-04-28.md @@ -0,0 +1,63 @@ +# Label-reset workflow validation (2026-04-28) + +End-to-end test of `.github/workflows/pr-panel-label-reset.yml` — +the deterministic plain-Actions companion that strips verdict labels +when a PR receives a new commit (S7 DETERMINISTIC TOOL BRIDGE). + +## Setup + +Probe PR: microsoft/apm#1025 (closed) + +- Base: `refactor/review-panel-fanout` (so the workflow YAML is + loaded from the feature branch — required because + `pull_request: synchronize` reads the BASE-branch workflow file). +- Head: `test/label-reset-probe` (deleted after test). +- Initial state: one commit ahead of base, no labels. + +## Procedure + +1. Applied `panel-rejected` label manually (simulating a panel verdict). +2. Pushed a second commit to `test/label-reset-probe` head ref. +3. The push fires `pull_request: synchronize` on PR #1025. +4. Reset workflow runs from the base-branch YAML and iterates over + `panel-approved` / `panel-rejected`, calling `gh pr edit + --remove-label` for each. + +## Result: PASS + +| Check | Expected | Actual | +|-------|----------|--------| +| Workflow ran on synchronize | yes | yes — run 25076314935 | +| Run status | success | success (13s) | +| `panel-rejected` removed | yes | yes (labels: []) | +| Missing `panel-approved` handled gracefully | no error | yes (if/else swallows) | + +`panel-approved` path uses identical code in the same loop; testing +one path proves the logic. + +## Side finding + +The `panel-approved` repo label did NOT exist at test time. The +refactor PR introduces the workflow and the contract but did not +create the label itself. Created post-test: + +- `panel-approved` (#0E8A16, green) — Apm-review-panel verdict: APPROVE. Removed automatically on next push. +- `panel-rejected` (#B60205, red) — description and color updated for clarity. + +## Run log excerpt + +``` +Removed label 'panel-approved' (or it was not present). +Removed label 'panel-rejected' (or it was not present). +``` + +(Both branches of the conditional behave the same; the message text +is intentionally vague because gh CLI exit codes do not distinguish +"removed" from "was already absent".) + +## Cross-references + +- Workflow file: `.github/workflows/pr-panel-label-reset.yml` +- Probe PR: https://github.com/microsoft/apm/pull/1025 +- Run: https://github.com/microsoft/apm/actions/runs/25076314935 +- Companion e2e for the panel itself: `results-e2e-pr931-2026-04-28.md` diff --git a/.apm/skills/apm-review-panel/evals/results-trigger-2026-04-28.md b/.apm/skills/apm-review-panel/evals/results-trigger-2026-04-28.md new file mode 100644 index 000000000..42804e9e8 --- /dev/null +++ b/.apm/skills/apm-review-panel/evals/results-trigger-2026-04-28.md @@ -0,0 +1,74 @@ +# Trigger Eval Self-Run (LLM dispatcher = me) + +Skill description (verbatim from `.apm/skills/apm-review-panel/SKILL.md`): +> Use this skill to run a multi-persona expert panel review on a labelled +> pull request in microsoft/apm. The panel fans out to five mandatory +> specialists plus one conditional auth specialist, all running in their +> own agent threads, and a CEO synthesizer. The orchestrator is the sole +> writer to the PR: one comment plus exactly one verdict label +> (panel-approved or panel-rejected, derived deterministically from the +> aggregated findings). Activate when a non-trivial PR needs cross-cutting +> review (architecture, CLI logging, DevX UX, supply-chain security, +> growth/positioning, optionally auth, with CEO arbitration). + +For each query I record: +- `expected` = label from trigger-evals.json (TRIGGER / NOT) +- `judged` = my dispatch decision as an LLM reading the description above +- `match` = expected == judged + +## Should-trigger TRAIN (5) + +| # | Query | Expected | Judged | Match | Reasoning | +|---|---|---|---|---|---| +| 1 | "review this PR with the expert panel" | TRIGGER | TRIGGER | YES | "expert panel" + "PR" maps cleanly to "multi-persona expert panel review on a labelled pull request" | +| 2 | "run the apm-review-panel against this branch" | TRIGGER | TRIGGER | YES | exact name match | +| 3 | "do a multi-persona review of PR #1234" | TRIGGER | TRIGGER | YES | "multi-persona" is verbatim in description | +| 4 | "panel-review this PR" | TRIGGER | TRIGGER | YES | "panel" + "PR" + uses the trigger-label term as verb | +| 5 | "get the architecture + security + UX review on this PR" | TRIGGER | TRIGGER | YES | enumerates 3 of the 6 listed lenses + PR scope | + +Train pass rate: 5/5 = 1.00 + +## Should-trigger VAL (3) + +| # | Query | Expected | Judged | Match | Reasoning | +|---|---|---|---|---|---| +| 6 | "ask the python architect, cli logging expert, and security expert to weigh in on this PR" | TRIGGER | TRIGGER | YES | enumerates 3 of the 5 mandatory specialists by name + PR scope | +| 7 | "have the apm panel verdict this PR" | TRIGGER | TRIGGER | YES | "apm panel" + "verdict" matches "verdict label (panel-approved or panel-rejected)" | +| 8 | "i need a cross-cutting expert review of this change before merge" | TRIGGER | TRIGGER | YES | "cross-cutting" is verbatim in description; "before merge" implies PR scope | + +Val pass rate: 3/3 = 1.00 (gate: >= 0.5) -- PASS + +## Should-NOT-trigger TRAIN (5) + +| # | Query | Expected | Judged | Match | Reasoning | +|---|---|---|---|---|---| +| 9 | "give me a code review of this file" | NOT | NOT | YES | single file, no PR framing, no panel language; would route to a generic code-review skill | +| 10 | "what does the python architect think of this class hierarchy" | NOT | NOT | YES | single persona, not panel; description requires "multi-persona" | +| 11 | "review this docstring" | NOT | NOT | YES | scope is one docstring, not a PR; no panel framing | +| 12 | "fix the lint errors in src/" | NOT | NOT | YES | mechanical task, not review; description is for review only | +| 13 | "what is the apm-review-panel skill" | NOT | NOT | YES | meta-question about the skill, not a request to run it; an info skill should answer this | + +Train pass rate (negative): 5/5 = 1.00 (correct rejection) + +## Should-NOT-trigger VAL (3) + +| # | Query | Expected | Judged | Match | Reasoning | +|---|---|---|---|---|---| +| 14 | "explain how the auth resolver works" | NOT | NOT | YES | explanation request, not review; auth-expert mention is incidental | +| 15 | "summarize the diff of PR #1234" | NOT | NOT | YES | summarization, not multi-persona review with verdict | +| 16 | "draft a PR description for me" | NOT | NOT | YES | authoring, not reviewing; pr-description-skill territory | + +Val pass rate (negative): 3/3 = 1.00 (gate: < 0.5 false-positive rate, i.e. >= 0.5 correct rejection) -- PASS + +## Summary + +| Split | Should-trigger correct | Should-NOT correct | Combined | +|---|---|---|---| +| Train (n=10) | 5/5 = 1.00 | 5/5 = 1.00 | 10/10 = 1.00 | +| Val (n=6) | 3/3 = 1.00 | 3/3 = 1.00 | 6/6 = 1.00 | + +**SHIP GATE (validation split):** +- Should-trigger pass rate: 1.00 >= 0.5 -- PASS +- Should-NOT pass rate: 1.00 >= 0.5 -- PASS + +**RESULT: PASS.** Dispatch description discriminates cleanly with no false positives or false negatives in this run. Caveat: I am a single LLM judging my own routing; the canonical eval would run a different model as dispatcher and average over multiple samples. But this is real LLM judgment, not hand-waving. diff --git a/.apm/skills/apm-review-panel/evals/run-verdict-harness.py b/.apm/skills/apm-review-panel/evals/run-verdict-harness.py new file mode 100644 index 000000000..b504f0ea3 --- /dev/null +++ b/.apm/skills/apm-review-panel/evals/run-verdict-harness.py @@ -0,0 +1,277 @@ +#!/usr/bin/env python3 +""" +Deterministic verdict harness for the apm-review-panel skill. + +Validates two things end-to-end WITHOUT spinning up an LLM: +1. Synthetic panelist JSON returns conform to panelist-return-schema.json. +2. The orchestrator's deterministic verdict rule produces the expected + verdict + label mapping per the eval scenarios. + +This is genesis Step 8 evals gate, deterministic slice. It does NOT +prove that a real LLM panelist will return well-formed JSON (that +requires option B, the branch-pin end-to-end test). It DOES prove that +when a panelist returns well-formed JSON, the orchestrator math is +correct and the schema rejects malformed shapes. + +Usage: + uv run --with jsonschema python3 \ + .apm/skills/apm-review-panel/evals/run-verdict-harness.py +""" + +from __future__ import annotations + +import json +import sys +from pathlib import Path + +import jsonschema + +ROOT = Path(__file__).resolve().parents[1] +SCHEMA_PANELIST = json.loads( + (ROOT / "assets" / "panelist-return-schema.json").read_text() +) +SCHEMA_CEO = json.loads((ROOT / "assets" / "ceo-return-schema.json").read_text()) + + +def derive_verdict(panelists: list[dict]) -> tuple[str, str]: + """ + Pure verdict derivation per SKILL.md execution checklist step 6: + total_required = sum(len(p["required"]) for p in panelists if p.get("active", True)) + verdict = "APPROVE" if total_required == 0 else "REJECT" + """ + total_required = sum( + len(p["required"]) for p in panelists if p.get("active", True) + ) + verdict = "APPROVE" if total_required == 0 else "REJECT" + label = "panel-approved" if verdict == "APPROVE" else "panel-rejected" + return verdict, label + + +def case_clean_pr() -> dict: + """Synthetic returns for the docs-only PR scenario.""" + panelists = [ + {"persona": "python-architect", "required": [], "nits": []}, + { + "persona": "cli-logging-expert", + "required": [], + "nits": [ + { + "summary": "Wording: 'aliasing' could be 'aliases'", + "rationale": "Reads more naturally to npm/pip users", + "file": "README.md", + "line": 142, + } + ], + }, + {"persona": "devx-ux-expert", "required": [], "nits": []}, + {"persona": "supply-chain-security-expert", "required": [], "nits": []}, + {"persona": "oss-growth-hacker", "required": [], "nits": []}, + { + "persona": "auth-expert", + "active": False, + "inactive_reason": "README.md has no auth surface; no fast-path file matched.", + "required": [], + "nits": [], + }, + ] + ceo = { + "arbitration": ( + "Documentation-only change pointing fish users to an existing " + "recipe. All five mandatory specialists agree no required actions; " + "auth-expert correctly inactive." + ) + } + return { + "name": "clean-pr", + "panelists": panelists, + "ceo": ceo, + "expected_verdict": "APPROVE", + "expected_label": "panel-approved", + } + + +def case_rejected_pr() -> dict: + """Synthetic returns for the inline-retry-loop PR scenario.""" + panelists = [ + { + "persona": "python-architect", + "required": [ + { + "summary": "Inline retry loop violates Strategy pattern in deps/", + "rationale": ( + "Codebase already has chain-of-responsibility retry " + "via AuthResolver; reuse or extract a shared helper " + "instead of inlining a fixed loop." + ), + "file": "src/apm_cli/deps/github_downloader.py", + "line": 218, + "suggestion": "Extract retry into deps/_retry.py with same shape as AuthResolver chain.", + } + ], + "nits": [], + }, + { + "persona": "cli-logging-expert", + "required": [], + "nits": [ + { + "summary": "Missing [!] warning when retry kicks in", + "rationale": "User-visible network blip; STATUS_SYMBOLS guidance applies.", + "file": "src/apm_cli/deps/github_downloader.py", + } + ], + }, + { + "persona": "devx-ux-expert", + "required": [], + "nits": [ + { + "summary": "retries=3 undocumented in CLI help", + "rationale": "Mental model gap; npm/pip users expect to see configurable retries surfaced.", + } + ], + }, + {"persona": "supply-chain-security-expert", "required": [], "nits": []}, + {"persona": "oss-growth-hacker", "required": [], "nits": []}, + { + "persona": "auth-expert", + "active": True, + "required": [], + "nits": [ + { + "summary": "Consider 401-aware retry", + "rationale": "Stale-PAT handling already exists in AuthResolver; align.", + } + ], + }, + ] + ceo = { + "arbitration": ( + "Retry behavior is sound but the inline loop misses the " + "codebase's existing retry abstraction. Python Architect's " + "required action holds; nits across logging/devx/auth align." + ), + "dissent_notes": "", + } + return { + "name": "rejected-pr", + "panelists": panelists, + "ceo": ceo, + "expected_verdict": "REJECT", + "expected_label": "panel-rejected", + } + + +def case_malformed_panelist() -> dict: + """Negative case: panelist forgot the `nits` field. S4 gate must fail.""" + return { + "name": "malformed-missing-nits", + "panelist": {"persona": "python-architect", "required": []}, + "expect_schema_error": True, + } + + +def case_unknown_persona() -> dict: + """Negative case: persona enum violation.""" + return { + "name": "malformed-unknown-persona", + "panelist": { + "persona": "unknown-persona", + "required": [], + "nits": [], + }, + "expect_schema_error": True, + } + + +def case_disposition_leak() -> dict: + """Negative case: panelist tried to add a verdict field. additionalProperties:false catches.""" + return { + "name": "malformed-disposition-leak", + "panelist": { + "persona": "python-architect", + "required": [], + "nits": [], + "disposition": "APPROVE", + }, + "expect_schema_error": True, + } + + +def run_positive(case: dict) -> tuple[bool, str]: + """Validate every panelist + CEO return, then check verdict.""" + notes = [] + for p in case["panelists"]: + try: + jsonschema.validate(p, SCHEMA_PANELIST) + except jsonschema.ValidationError as e: + return False, f"panelist {p.get('persona')} failed schema: {e.message}" + notes.append( + f" - {p['persona']}: active={p.get('active', True)}, " + f"required={len(p['required'])}, nits={len(p['nits'])}" + ) + + try: + jsonschema.validate(case["ceo"], SCHEMA_CEO) + except jsonschema.ValidationError as e: + return False, f"CEO failed schema: {e.message}" + + verdict, label = derive_verdict(case["panelists"]) + if verdict != case["expected_verdict"]: + return ( + False, + f"expected verdict={case['expected_verdict']}, got {verdict}", + ) + if label != case["expected_label"]: + return False, f"expected label={case['expected_label']}, got {label}" + + return True, "\n".join( + notes + + [ + f" -> verdict={verdict}, label={label} (expected matched)", + ] + ) + + +def run_negative(case: dict) -> tuple[bool, str]: + """Validate that schema rejects the malformed shape.""" + try: + jsonschema.validate(case["panelist"], SCHEMA_PANELIST) + except jsonschema.ValidationError as e: + return True, f"correctly rejected: {e.message.splitlines()[0]}" + return False, "schema accepted a malformed return; S4 gate would have let it through" + + +def main() -> int: + cases_pos = [case_clean_pr(), case_rejected_pr()] + cases_neg = [case_malformed_panelist(), case_unknown_persona(), case_disposition_leak()] + + ok = True + print("APM Review Panel - Deterministic Verdict Harness") + print("=" * 60) + print() + print("Positive cases (well-formed JSON, expected verdict):") + for c in cases_pos: + passed, msg = run_positive(c) + marker = "[+] PASS" if passed else "[x] FAIL" + print(f"{marker} {c['name']}") + print(msg) + print() + ok = ok and passed + + print("Negative cases (malformed JSON, S4 schema gate must reject):") + for c in cases_neg: + passed, msg = run_negative(c) + marker = "[+] PASS" if passed else "[x] FAIL" + print(f"{marker} {c['name']}") + print(f" {msg}") + print() + ok = ok and passed + + print("=" * 60) + print("RESULT:", "ALL PASS" if ok else "FAIL") + return 0 if ok else 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/.apm/skills/apm-review-panel/evals/trigger-evals.json b/.apm/skills/apm-review-panel/evals/trigger-evals.json new file mode 100644 index 000000000..cbedde29d --- /dev/null +++ b/.apm/skills/apm-review-panel/evals/trigger-evals.json @@ -0,0 +1,31 @@ +{ + "description": "Trigger evals for the apm-review-panel skill dispatch description. 8 should-trigger + 8 should-NOT-trigger. 60/40 train/val split (10 train, 6 val). Validation split is the ship gate.", + "should_trigger": { + "train": [ + "review this PR with the expert panel", + "run the apm-review-panel against this branch", + "do a multi-persona review of PR #1234", + "panel-review this PR", + "get the architecture + security + UX review on this PR" + ], + "val": [ + "ask the python architect, cli logging expert, and security expert to weigh in on this PR", + "have the apm panel verdict this PR", + "i need a cross-cutting expert review of this change before merge" + ] + }, + "should_not_trigger": { + "train": [ + "give me a code review of this file", + "what does the python architect think of this class hierarchy", + "review this docstring", + "fix the lint errors in src/", + "what is the apm-review-panel skill" + ], + "val": [ + "explain how the auth resolver works", + "summarize the diff of PR #1234", + "draft a PR description for me" + ] + } +} diff --git a/.github/agents/apm-ceo.agent.md b/.github/agents/apm-ceo.agent.md index 2d99f328b..852b74100 100644 --- a/.github/agents/apm-ceo.agent.md +++ b/.github/agents/apm-ceo.agent.md @@ -94,3 +94,34 @@ For any non-trivial change, ask: - You do NOT touch `WIP/growth-strategy.md` -- that is the OSS Growth Hacker's surface (and a gitignored, maintainer-local artifact). You consume their output as input to strategic calls. + +## Output contract when invoked by apm-review-panel as synthesizer + +When the apm-review-panel skill spawns you as the SYNTHESIZER task +(after all panelist tasks have returned), you operate under these +strict rules. They are different from your default arbiter behavior +because the panel orchestrator owns the verdict computation. + +- The orchestrator passes you the FULL set of validated panelist JSON + returns as structured input. +- You produce ARBITRATION PROSE ONLY. You do NOT pick the verdict. + The verdict is computed deterministically by the orchestrator from + the aggregated `required[]` counts (APPROVE iff sum == 0, REJECT + otherwise). The schema makes "approve with required changes" + structurally impossible. +- You return JSON matching `assets/ceo-return-schema.json` from the + apm-review-panel skill, as the FINAL message of your task. No prose + around the JSON; the orchestrator parses your last message. + - `arbitration`: 1-3 paragraphs. Resolve any disagreement between + specialists. Surface strategic implications (positioning, breaking + change, naming, scope). If specialists agreed and the change is + uncontroversial, say so plainly. + - `dissent_notes` (optional): when two or more panelists disagreed + on whether a finding is REQUIRED vs NIT, name the disagreement + and state which side you side with and why. + - `growth_signal` (optional): echo any side-channel note from the + oss-growth-hacker panelist that should be amplified in the + headline (conversion, narrative, breaking-change comms). +- You MUST NOT call `gh pr comment`, `gh pr edit`, `gh issue`, or any + other GitHub write command. You MUST NOT post to `safe-outputs`. + The orchestrator is the sole writer. diff --git a/.github/agents/auth-expert.agent.md b/.github/agents/auth-expert.agent.md index cd3331860..162f741b9 100644 --- a/.github/agents/auth-expert.agent.md +++ b/.github/agents/auth-expert.agent.md @@ -55,3 +55,33 @@ When reviewing or writing auth code: - Classic PATs (`ghp_`) work cross-org but are being deprecated -- prefer fine-grained - ADO uses Basic auth with base64-encoded `:PAT` -- different from GitHub bearer token flow - ADO also supports AAD bearer tokens via `az account get-access-token` (resource `499b84ac-1321-427f-aa17-267ca6975798`); precedence is `ADO_APM_PAT` -> az bearer -> fail. Stale PATs (401) silently fall back to the bearer with a `[!]` warning. See the auth skill for the four diagnostic cases. + +## Output contract when invoked by apm-review-panel + +When the apm-review-panel skill spawns you as a panelist task, you +operate under these strict rules. They override any default behavior +that would post comments or apply labels. + +- You read the persona scope above and the PR title/body/diff passed + in the task prompt. +- You produce findings in TWO buckets only: + - `required`: blocks merge. Real, actionable, citing file/line where + possible. Anything you put here will produce a REJECT verdict. + - `nits`: one-line suggestions the author can skip. No third bucket, + no "consider", no "optional follow-up". If a finding is real and + matters, it is required. If not, it is a nit. +- You return JSON matching `assets/panelist-return-schema.json` from + the apm-review-panel skill, as the FINAL message of your task. No + prose around the JSON; the orchestrator parses your last message. +- You MUST NOT call `gh pr comment`, `gh pr edit`, `gh issue`, or any + other GitHub write command. You MUST NOT post to `safe-outputs`. + You MUST NOT touch the PR state. The orchestrator is the sole + writer; your only output channel is the JSON return. +- If you have nothing blocking AND nothing worth nitting, return + `{persona: "", required: [], nits: []}`. That is a + valid and preferred answer when true. +- Auth-specific: when the apm-review-panel orchestrator spawns you + with "active=false" framing (the conditional rule did not fire), you + return `{persona: "auth-expert", active: false, inactive_reason: + "", required: [], nits: []}` + WITHOUT performing a full review. Trust the orchestrator's routing. diff --git a/.github/agents/cli-logging-expert.agent.md b/.github/agents/cli-logging-expert.agent.md index f6dca5b8b..c11e08b05 100644 --- a/.github/agents/cli-logging-expert.agent.md +++ b/.github/agents/cli-logging-expert.agent.md @@ -48,3 +48,28 @@ You are an expert on CLI output UX with excellent taste. You ensure verbose mode 3. **Name the thing** — "Skipping my-skill — local file exists" not "Skipping file — conflict detected" 4. **Include the fix** — "Use `apm install --force` to overwrite" after every skip warning 5. **No emojis** — ASCII `STATUS_SYMBOLS` only, never emoji characters + +## Output contract when invoked by apm-review-panel + +When the apm-review-panel skill spawns you as a panelist task, you +operate under these strict rules. They override any default behavior +that would post comments or apply labels. + +- You read the persona scope above and the PR title/body/diff passed + in the task prompt. +- You produce findings in TWO buckets only: + - `required`: blocks merge. Real, actionable, citing file/line where + possible. Anything you put here will produce a REJECT verdict. + - `nits`: one-line suggestions the author can skip. No third bucket, + no "consider", no "optional follow-up". If a finding is real and + matters, it is required. If not, it is a nit. +- You return JSON matching `assets/panelist-return-schema.json` from + the apm-review-panel skill, as the FINAL message of your task. No + prose around the JSON; the orchestrator parses your last message. +- You MUST NOT call `gh pr comment`, `gh pr edit`, `gh issue`, or any + other GitHub write command. You MUST NOT post to `safe-outputs`. + You MUST NOT touch the PR state. The orchestrator is the sole + writer; your only output channel is the JSON return. +- If you have nothing blocking AND nothing worth nitting, return + `{persona: "", required: [], nits: []}`. That is a + valid and preferred answer when true. diff --git a/.github/agents/devx-ux-expert.agent.md b/.github/agents/devx-ux-expert.agent.md index 811eac7b8..25b10e0e2 100644 --- a/.github/agents/devx-ux-expert.agent.md +++ b/.github/agents/devx-ux-expert.agent.md @@ -78,3 +78,28 @@ When reviewing a command, command help text, or a workflow change, ask: expert when a UX change touches auth, lockfile integrity, or download paths. - Strategic naming / positioning calls escalate to the APM CEO. + +## Output contract when invoked by apm-review-panel + +When the apm-review-panel skill spawns you as a panelist task, you +operate under these strict rules. They override any default behavior +that would post comments or apply labels. + +- You read the persona scope above and the PR title/body/diff passed + in the task prompt. +- You produce findings in TWO buckets only: + - `required`: blocks merge. Real, actionable, citing file/line where + possible. Anything you put here will produce a REJECT verdict. + - `nits`: one-line suggestions the author can skip. No third bucket, + no "consider", no "optional follow-up". If a finding is real and + matters, it is required. If not, it is a nit. +- You return JSON matching `assets/panelist-return-schema.json` from + the apm-review-panel skill, as the FINAL message of your task. No + prose around the JSON; the orchestrator parses your last message. +- You MUST NOT call `gh pr comment`, `gh pr edit`, `gh issue`, or any + other GitHub write command. You MUST NOT post to `safe-outputs`. + You MUST NOT touch the PR state. The orchestrator is the sole + writer; your only output channel is the JSON return. +- If you have nothing blocking AND nothing worth nitting, return + `{persona: "", required: [], nits: []}`. That is a + valid and preferred answer when true. diff --git a/.github/agents/oss-growth-hacker.agent.md b/.github/agents/oss-growth-hacker.agent.md index dbe67877f..7cad1a8e5 100644 --- a/.github/agents/oss-growth-hacker.agent.md +++ b/.github/agents/oss-growth-hacker.agent.md @@ -97,3 +97,28 @@ The CEO consumes your annotations when making the final call. - You write only to `WIP/growth-strategy.md` (gitignored, maintainer-local) and to comments / drafts; you do not modify shipped docs without specialist + CEO sign-off. Never stage or commit anything under `WIP/`. + +## Output contract when invoked by apm-review-panel + +When the apm-review-panel skill spawns you as a panelist task, you +operate under these strict rules. They override any default behavior +that would post comments or apply labels. + +- You read the persona scope above and the PR title/body/diff passed + in the task prompt. +- You produce findings in TWO buckets only: + - `required`: blocks merge. Real, actionable, citing file/line where + possible. Anything you put here will produce a REJECT verdict. + - `nits`: one-line suggestions the author can skip. No third bucket, + no "consider", no "optional follow-up". If a finding is real and + matters, it is required. If not, it is a nit. +- You return JSON matching `assets/panelist-return-schema.json` from + the apm-review-panel skill, as the FINAL message of your task. No + prose around the JSON; the orchestrator parses your last message. +- You MUST NOT call `gh pr comment`, `gh pr edit`, `gh issue`, or any + other GitHub write command. You MUST NOT post to `safe-outputs`. + You MUST NOT touch the PR state. The orchestrator is the sole + writer; your only output channel is the JSON return. +- If you have nothing blocking AND nothing worth nitting, return + `{persona: "", required: [], nits: []}`. That is a + valid and preferred answer when true. diff --git a/.github/agents/python-architect.agent.md b/.github/agents/python-architect.agent.md index ebedc9af5..1ddc12662 100644 --- a/.github/agents/python-architect.agent.md +++ b/.github/agents/python-architect.agent.md @@ -223,3 +223,28 @@ Rules for this subsection: over-engineering, write "Pragmatic suggestion: none -- the current shape is the simplest correct design at this scope." That is a valid and preferred answer when true. + +## Output contract when invoked by apm-review-panel + +When the apm-review-panel skill spawns you as a panelist task, you +operate under these strict rules. They override any default behavior +that would post comments or apply labels. + +- You read the persona scope above and the PR title/body/diff passed + in the task prompt. +- You produce findings in TWO buckets only: + - `required`: blocks merge. Real, actionable, citing file/line where + possible. Anything you put here will produce a REJECT verdict. + - `nits`: one-line suggestions the author can skip. No third bucket, + no "consider", no "optional follow-up". If a finding is real and + matters, it is required. If not, it is a nit. +- You return JSON matching `assets/panelist-return-schema.json` from + the apm-review-panel skill, as the FINAL message of your task. No + prose around the JSON; the orchestrator parses your last message. +- You MUST NOT call `gh pr comment`, `gh pr edit`, `gh issue`, or any + other GitHub write command. You MUST NOT post to `safe-outputs`. + You MUST NOT touch the PR state. The orchestrator is the sole + writer; your only output channel is the JSON return. +- If you have nothing blocking AND nothing worth nitting, return + `{persona: "", required: [], nits: []}`. That is a + valid and preferred answer when true. diff --git a/.github/agents/supply-chain-security-expert.agent.md b/.github/agents/supply-chain-security-expert.agent.md index 4c8a2eafd..3fc055532 100644 --- a/.github/agents/supply-chain-security-expert.agent.md +++ b/.github/agents/supply-chain-security-expert.agent.md @@ -94,3 +94,28 @@ file integration, ask: trade-off to the DevX UX expert and escalate to the CEO. - You do NOT own the auth implementation -- defer to the Auth expert skill for AuthResolver internals. + +## Output contract when invoked by apm-review-panel + +When the apm-review-panel skill spawns you as a panelist task, you +operate under these strict rules. They override any default behavior +that would post comments or apply labels. + +- You read the persona scope above and the PR title/body/diff passed + in the task prompt. +- You produce findings in TWO buckets only: + - `required`: blocks merge. Real, actionable, citing file/line where + possible. Anything you put here will produce a REJECT verdict. + - `nits`: one-line suggestions the author can skip. No third bucket, + no "consider", no "optional follow-up". If a finding is real and + matters, it is required. If not, it is a nit. +- You return JSON matching `assets/panelist-return-schema.json` from + the apm-review-panel skill, as the FINAL message of your task. No + prose around the JSON; the orchestrator parses your last message. +- You MUST NOT call `gh pr comment`, `gh pr edit`, `gh issue`, or any + other GitHub write command. You MUST NOT post to `safe-outputs`. + You MUST NOT touch the PR state. The orchestrator is the sole + writer; your only output channel is the JSON return. +- If you have nothing blocking AND nothing worth nitting, return + `{persona: "", required: [], nits: []}`. That is a + valid and preferred answer when true. diff --git a/.github/skills/apm-review-panel/SKILL.md b/.github/skills/apm-review-panel/SKILL.md index 1a5dae18e..f2bdf92ce 100644 --- a/.github/skills/apm-review-panel/SKILL.md +++ b/.github/skills/apm-review-panel/SKILL.md @@ -1,59 +1,99 @@ --- name: apm-review-panel description: >- - Use this skill to run a six-agent panel review (plus one conditional - auth specialist) for non-trivial PRs, design proposals, release - decisions, and other cross-cutting changes in microsoft/apm. Emit one - synthesized verdict comment. + Use this skill to run a multi-persona expert panel review on a labelled + pull request in microsoft/apm. The panel fans out to five mandatory + specialists plus one conditional auth specialist, all running in their + own agent threads, and a CEO synthesizer. The orchestrator is the sole + writer to the PR: one comment plus exactly one verdict label + (panel-approved or panel-rejected, derived deterministically from the + aggregated findings). Activate when a non-trivial PR needs cross-cutting + review (architecture, CLI logging, DevX UX, supply-chain security, + growth/positioning, optionally auth, with CEO arbitration). --- -# APM Review Panel -- Expert Review Orchestration - -The panel is fixed at **5 mandatory specialist lenses + 1 conditional -auth lens + 1 arbiter lens = up to 7 persona sections in one verdict -comment**. You play each lens in turn from inside a single agent loop -(progressive-disclosure skill model -- no sub-agent dispatch). Routing -chooses *which* lenses execute; it never changes which headings appear -in the final verdict. +# APM Review Panel - Fan-Out Expert Review + +The panel is FAN-OUT + SYNTHESIZER. Each persona runs in its own agent +thread (via the `task` tool) and returns JSON matching +`assets/panelist-return-schema.json`. The orchestrator schema-validates +each return, hands all returns to the apm-ceo synthesizer (also a task +thread, returns JSON matching `assets/ceo-return-schema.json`), then +deterministically derives the verdict and writes ONE comment plus ONE +verdict label. + +## Architecture invariants + +- **Verdict is binary and deterministic.** APPROVE iff + `sum(len(p.required) for p in panelists if p.active) == 0`. + REJECT otherwise. The CEO does NOT pick the verdict; the CEO writes + the arbitration narrative. This kills the "approve with reservations" + failure mode at the schema level. +- **Two severity buckets only.** `required` blocks merge. `nits` are + one-line suggestions the author can skip. There is no "optional", + no "consider", no "maybe later". If a finding is real and matters, + it is required. If not, it is a nit. No third bucket accumulates debt. +- **Single-writer interlock.** Only the orchestrator writes to + `safe-outputs` (one `add-comment`, one `add-labels`, one + `remove-labels`). Panelist subagents and the CEO subagent return + JSON only and MUST NOT call any `gh` write command, post comments, + apply labels, or touch the PR state. +- **Single-emission discipline.** Exactly one comment per panel run, + rendered from `assets/verdict-template.md` after all subagents return. ## Agent roster -| Agent | Persona | Always active? | -|-------|---------|----------------| +| Agent | Role | Always active? | +|-------|------|----------------| | [Python Architect](../../agents/python-architect.agent.md) | Architectural Reviewer | Yes | | [CLI Logging Expert](../../agents/cli-logging-expert.agent.md) | Output UX Reviewer | Yes | | [DevX UX Expert](../../agents/devx-ux-expert.agent.md) | Package-Manager UX | Yes | | [Supply Chain Security Expert](../../agents/supply-chain-security-expert.agent.md) | Threat-Model Reviewer | Yes | -| [OSS Growth Hacker](../../agents/oss-growth-hacker.agent.md) | Adoption Strategist | Yes (side-channel to CEO) | -| [Auth Expert](../../agents/auth-expert.agent.md) | Auth / Token Reviewer | Conditional (see "Conditional panelist" below) | -| [APM CEO](../../agents/apm-ceo.agent.md) | Strategic Owner / Arbiter | Yes (always arbitrates) | +| [OSS Growth Hacker](../../agents/oss-growth-hacker.agent.md) | Adoption Strategist | Yes | +| [Auth Expert](../../agents/auth-expert.agent.md) | Auth / Token Reviewer | Conditional (see below) | +| [APM CEO](../../agents/apm-ceo.agent.md) | Strategic Arbiter / Synthesizer | Yes | -## Routing topology +## Topology ``` - python-architect cli-logging-expert devx-ux-expert supply-chain-security-expert - \_______________________|______________________________/ - | <-- auth-expert (conditional) - v - apm-ceo <---- oss-growth-hacker - (final call / arbiter) (annotates findings; - updates growth-strategy) + apm-review-panel SKILL (orchestrator thread) + | + FAN-OUT via task tool (panelists in parallel) + | + +-----+-------+-------+-----+-----+------+ + v v v v v v v (cond.) + py cli dx-ux sec grw auth + | | | | | | + | each returns JSON per panelist-return-schema.json + +-----+-------+-------+-----+-------------+ + | + v <-- S4 schema-validate + v <-- on malformed: re-spawn that persona + v + task: apm-ceo synthesizer + - aggregates required[] across panelists + - resolves dissent + - returns ceo-return-schema.json (NO verdict field) + | + v <-- DETERMINISTIC verdict gate + | APPROVE iff sum(required) == 0 + | REJECT otherwise + v + orchestrator (sole writer) + | | | + v v v + add-comment add-labels remove-labels + (max:2) [panel-approved [panel-review] + XOR panel-rejected] ``` -- **Specialists raise findings independently** -- no implicit consensus. -- **CEO arbitrates** when specialists disagree or when a finding has - strategic implications (positioning, breaking change, naming, scope). -- **Growth Hacker is a side-channel** to the CEO: never blocks a - specialist finding; annotates it with growth implications and - escalates to the CEO when relevant. - ## Conditional panelist: Auth Expert Auth Expert is the only conditional panelist. Activate `auth-expert` if either rule below matches. -1. **Fast-path file trigger.** Activate the Auth Expert lens - immediately when the PR changes any of: +1. **Fast-path file trigger.** Activate the Auth Expert task when the + PR changes any of: - `src/apm_cli/core/auth.py` - `src/apm_cli/core/token_manager.py` - `src/apm_cli/core/azure_cli.py` @@ -64,7 +104,7 @@ if either rule below matches. - `src/apm_cli/deps/registry_proxy.py` 2. **Fallback self-check.** If no fast-path file matched, answer this - before activating the lens: + before activating: > Does this PR change authentication behavior, token management, > credential resolution, host classification used by `AuthResolver`, @@ -74,190 +114,159 @@ if either rule below matches. Routing rule: -- **YES** -> take the Auth Expert lens (per the Persona pass - procedure) and capture its findings. -- **NO** -> record `Auth Expert inactive reason: ` in working notes; do not take the lens. +- **YES** -> spawn the auth-expert task with the standard panelist + contract. It returns JSON with `active: true`. +- **NO** -> spawn the auth-expert task ANYWAY but instruct it to + return `{persona: "auth-expert", active: false, inactive_reason: + "", required: [], nits: []}`. + This keeps the schema uniform and the verdict template happy. - Never use wildcard heuristics like `*auth*`, `*token*`, or `*credential*` as the sole trigger. -## Routing matrix - -These routes choose which specialists are emphasised for a given PR -type. They do **not** change the verdict shape. The final comment -always uses every persona heading in `assets/verdict-template.md`; -the only persona that can render as `Not activated -- ` is -Auth Expert (per the conditional rule above). - -### Code review (architecture + logging) -1. Python Architect reviews structure / patterns / cross-file impact. -2. CLI Logging Expert reviews any output / logger changes. -3. CEO ratifies if the two disagree on abstraction vs consistency. - -### CLI UX review -1. DevX UX Expert reviews command surface, flags, help, error wording. -2. CLI Logging Expert reviews how outputs are emitted (logger methods). -3. Growth Hacker annotates if the change affects first-run conversion. -4. CEO ratifies any naming / positioning calls. - -### Security review -1. Supply Chain Security Expert maps the change to the threat model. -2. DevX UX Expert flags any ergonomics regression from the mitigation. -3. CEO arbitrates trade-offs; bias toward security on default behavior. - -### Auth review (only when the conditional Auth Expert is activated) -1. Auth Expert maps the change against AuthResolver, token precedence, - host classification, and credential helpers. -2. Supply Chain Security Expert checks for token-scoping or credential - leakage implications. -3. CEO ratifies any default-behavior change. - -### Release / comms review -1. CEO grounds the release framing in `gh` CLI stats. -2. Growth Hacker drafts hook + story angle; updates - `WIP/growth-strategy.md` (gitignored maintainer-local; create if absent). -3. Specialists sanity-check any technical claims in release notes. - -### Full panel review (non-trivial change) -1. Each mandatory specialist produces independent findings. -2. Auth Expert participates if the conditional rule above activates it. -3. Growth Hacker annotates findings with growth implications. -4. CEO synthesizes, resolves disagreements, makes the final call. -5. Surface decision and rationale to the author via the single verdict - comment. - -## Quality gates - -A non-trivial change passes when: - -- [ ] Python Architect: structure / patterns OK (or change explicitly - justified) -- [ ] CLI Logging Expert: output paths route through CommandLogger, - no direct `_rich_*` in commands -- [ ] DevX UX Expert: command surface familiar to npm/pip/cargo users, - every error has a next action -- [ ] Supply Chain Security Expert: no new path / auth / integrity - surface left unguarded; fails closed -- [ ] Auth Expert (only if activated): no regression to AuthResolver - precedence, host classification, or credential leakage surface -- [ ] APM CEO: trade-offs ratified, breaking changes have CHANGELOG + - migration line -- [ ] OSS Growth Hacker: conversion surfaces unaffected or improved; - `WIP/growth-strategy.md` updated if relevant (maintainer-local; - gitignored, never committed) - -## Notes - -- This skill orchestrates a panel **in your own context** -- you are - the only agent. You load each persona's `.agent.md` reference file - on demand (progressive disclosure), assume that persona's lens to - produce its findings, then move to the next persona. Do NOT spawn - sub-agents (no `task` tool dispatch) -- the panel is a sequence of - reasoning passes inside one agent loop, not a multi-agent fan-out. -- Persona detail lives in the linked `.agent.md` files. Read each - one when you switch to that persona; do not pre-load all of them. +## Routing matrix (emphasis only - all panelists always run) + +These routes describe WHICH specialist's findings the CEO weights more +heavily for a given PR type. They do NOT change which personas run - +every mandatory persona runs on every panel invocation, period. Routing +is a CEO synthesis hint, not a panelist gate. + +- **Architecture-heavy PR** -> CEO weights Python Architect findings + on abstraction calls; CLI Logging on consistency. +- **CLI UX PR** -> CEO weights DevX UX on command surface; CLI Logging + on output paths; Growth Hacker on first-run conversion. +- **Security PR** -> CEO biases toward Supply Chain Security on + default behavior; DevX UX flags ergonomics regression from any + mitigation. +- **Auth PR** (auth-expert active) -> CEO weights Auth Expert on + AuthResolver / token precedence; Supply Chain on token-scoping. +- **Release / comms PR** -> CEO weights Growth Hacker on hook + story + angle; specialists sanity-check technical claims. +- **Full panel** (default) -> CEO synthesizes equally; calls out any + dissent in `dissent_notes`. ## Execution checklist -When this skill is activated for a PR review, work through these -steps in order, in a single agent loop. Do not skip ahead and do not -emit any output before the final step. - -1. Read the PR context (title, body, labels, changed files, diff). - The orchestrating workflow already fetches this with `gh pr view` - / `gh pr diff` -- do not re-fetch. -2. Resolve the **Auth Expert conditional case** using the rule in - "Conditional panelist: Auth Expert" above. Record either an - activation decision (and proceed to step 3) or an `Auth Expert - inactive reason: ` in working notes. -3. For each mandatory persona (plus `auth-expert` if activated), - follow the **Persona pass procedure** below, one persona at a - time. Do not try to play multiple personas in a single pass. -4. Run the **pre-arbitration completeness gate**: - - Findings exist in working notes for the 5 mandatory specialists - (Python Architect, CLI Logging Expert, DevX UX Expert, Supply - Chain Security Expert, OSS Growth Hacker). - - Exactly one of `Auth Expert findings` or `Auth Expert inactive - reason` exists (neither = incomplete; both = inconsistent - routing). - - The Python Architect notes contain the OO / class mermaid - diagram, the execution-flow mermaid diagram, and the Design - patterns subsection declared in - `../../agents/python-architect.agent.md`. - - No persona section is missing or empty. - If any check fails, redo that persona's pass and repeat the gate. - Do not proceed to step 5 until the gate passes. -5. Take the **APM CEO** lens (load - `../../agents/apm-ceo.agent.md`) and arbitrate over the collected - findings -- still in your own context. CEO arbitration may run - only after the completeness gate has passed. -6. Now (and only now) load `assets/verdict-template.md` and fill it - in with the collected findings + arbitration. -7. Emit the filled template as exactly ONE comment via the workflow's - `safe-outputs.add-comment` channel. Never call the GitHub API - directly. This is the ONLY output emission for the entire panel - run -- no per-persona comments, no progress comments. - -### Persona pass procedure - -For each persona, run this exact procedure in your own context: - -1. Open the persona's `.agent.md` file (linked in the roster) and - read its scope, lens, anti-patterns, and required return shape. -2. From that persona's lens, review the PR title/body/diff against - the scope declared in the file. -3. Write the findings to working notes under - `: ` (or, for an inactive Auth Expert, - `Auth Expert inactive reason: `). -4. Drop the persona lens before moving on. Do not emit any comment - from inside a persona pass; persona findings stay in working - notes until step 7 synthesizes them. - -## Output contract - -This contract is non-negotiable -- it is the difference between a -panel review that lands as one cohesive verdict and one that fragments -into per-persona noise. - -- Produce **exactly one** comment per panel run. The - `safe-outputs.add-comment.max` cap in the workflow is a fail-soft - ceiling (currently 7, one per persona slot); the one-comment - discipline lives here. -- Use `assets/verdict-template.md` as the comment body. Keep its - section headings exactly as written. Adapt the body of each section - to the PR. Do not invent new top-level sections or drop existing - ones. -- CEO arbitration may run only after the completeness gate passes. -- Never emit findings as separate comments, intermediate progress - comments, or "I will now invoke X" status comments. -- Load `assets/verdict-template.md` **at synthesis time only** (step - 6 above) -- not at activation, not while collecting findings. +Work through these steps in order. Do not skip ahead. Do not emit any +output to the PR before step 6. + +1. **Read PR context** (the orchestrating workflow already fetched it + via `gh pr view` / `gh pr diff` in step 1 of `pr-review-panel.md`). + Identify changed files for the auth-expert routing decision. + +2. **Resolve the auth-expert conditional** using the rule above. + Decide: spawn auth-expert active, OR spawn auth-expert with + `active: false` + an `inactive_reason`. Either way, auth-expert + IS spawned - the schema requires uniform return shape. + +3. **Fan out panelist tasks.** Spawn the following tasks in PARALLEL + via the `task` tool, one task per persona: + - `python-architect` + - `cli-logging-expert` + - `devx-ux-expert` + - `supply-chain-security-expert` + - `oss-growth-hacker` + - `auth-expert` (always - active per step 2) + + Each task prompt MUST: + - Reference its persona file by relative path so the subagent loads + its own scope, lens, and anti-patterns. + - Include the PR number, title, body, and diff (passed inline). + - Cite `assets/panelist-return-schema.json` and require the + subagent to emit JSON matching that schema as its FINAL message. + - Restate the output contract: NO `gh` write commands, NO posting + comments, NO label changes, NO touching PR state. JSON return + only. + +4. **S4 schema gate.** When each panelist task returns, parse the JSON + and validate against `assets/panelist-return-schema.json`. On + validation failure: + - Re-spawn that ONE panelist with an explicit error message + pointing at the violated rule (e.g. "your previous return was + missing the `nits` field"). + - Maximum two re-spawn attempts per panelist. If still malformed, + synthesize a placeholder `{persona: "", active: true, + required: [], nits: [], extras: {schema_failure: ""}}` + and surface the failure in the CEO arbitration prompt. + +5. **Spawn the CEO synthesizer task.** Pass the full set of validated + panelist JSON returns to a `task` invocation that loads + `../../agents/apm-ceo.agent.md`. The prompt MUST: + - Provide all panelist returns as structured input. + - Ask for arbitration prose, dissent resolution, and any growth + signal worth amplifying. + - Cite `assets/ceo-return-schema.json` and require JSON return. + - Restate the contract: CEO does NOT pick verdict, only writes + arbitration. NO `gh` write commands. + + Validate the CEO return against `assets/ceo-return-schema.json`. + On failure, re-spawn once with the violation cited. + +6. **Compute verdict deterministically:** + + ``` + total_required = sum(len(p["required"]) for p in panelists if p.get("active", True)) + verdict = "APPROVE" if total_required == 0 else "REJECT" + verdict_label = "panel-approved" if verdict == "APPROVE" else "panel-rejected" + ``` + + The CEO does not influence this calculation. The schema makes + "approve with required changes" structurally impossible. + +7. **Render the comment.** Load `assets/verdict-template.md`, fill the + placeholders from the panelist + CEO JSON, and emit it as exactly + ONE comment via `safe-outputs.add-comment`. NEVER call the GitHub + API directly. NEVER emit per-persona comments or progress comments. + +8. **Apply the verdict label** via `safe-outputs.add-labels`: + `[verdict_label]` (one of `panel-approved` / `panel-rejected`). + +9. **Remove the trigger label** via `safe-outputs.remove-labels`: + `[panel-review]`. This makes the trigger idempotent - re-applying + the label will re-run the panel cleanly. + +## Output contract (non-negotiable) + +- Exactly ONE comment per panel run, top-loaded per + `assets/verdict-template.md`. The `safe-outputs.add-comment.max: 2` + is a fail-soft ceiling; the discipline lives here. +- Exactly ONE verdict label per panel run (`panel-approved` XOR + `panel-rejected`). +- Exactly ONE removal of the `panel-review` trigger label. +- Subagents (panelists + CEO) NEVER write to `safe-outputs`, NEVER + call `gh pr comment`, NEVER call `gh pr edit --add-label`. They + return JSON. The orchestrator is the sole writer. This discipline + is mirrored in each persona's "Output contract when invoked by + apm-review-panel" section. +- Never invent new top-level template sections or drop existing ones. ## Gotchas -- **Roster invariant.** The frontmatter description, the roster - table, the conditional-panelist rule, the verdict template, and the - quality gates MUST agree on the persona set. If you change one, - change all of them in the same edit. Description, roster, and - template are the three places drift most often appears. +- **Roster invariant.** The frontmatter description, the roster table, + the conditional rule, the verdict template, and the JSON schema + MUST agree on the persona set. If you change one, change all in the + same edit. - **False-negative auth gotcha.** Auth regressions can be introduced - from non-auth files that change the *inputs* to auth -- for - example host classification, dependency parsing, clone URL - construction, HTTP authorization headers, or call sites that - bypass `AuthResolver`. If a diff changes how a remote host, org, - token source, or fallback path is selected and you are not certain - it is auth-neutral, activate `auth-expert`. + from non-auth files that change the inputs to auth - host + classification, dependency parsing, clone URL construction, HTTP + authorization headers, or call sites that bypass `AuthResolver`. If + a diff changes how a remote host, org, token source, or fallback + path is selected and you are not certain it is auth-neutral, + activate auth-expert as `active: true`. - **Bundle layout on the runner.** When this skill runs inside the PR-review agentic workflow, the APM bundle is unpacked under - `.github/skills/apm-review-panel/` first, with `.apm/skills/...` - as a fallback. The asset path is the same relative to the skill - root (`assets/verdict-template.md`) in both layouts -- prefer the - `.github/...` path when present. -- **No multi-persona-in-one-pass.** Each persona has its own - `.agent.md` for a reason -- read it when you take that lens, write - the findings, then drop the lens before moving on. Trying to be all - personas in one reasoning pass is the most common cause of dropped - findings and mixed voices. -- **Single-emission discipline is fragile under interruption.** If - you find yourself wanting to "post a quick partial verdict and - then update it", don't. Buffer in working notes; emit once. + `.github/skills/apm-review-panel/` first, with `.apm/skills/...` as + a fallback. The asset paths are the same relative to the skill root + in both layouts. +- **Subagent write enforcement is contract-based, not sandbox-based.** + In gh-aw, tool permissions are workflow-scoped, not subagent-scoped, + so every spawned task technically inherits the same `gh` toolset. + The "subagents must not write" rule is enforced by the prompt + contract in each `.agent.md` plus the `safe-outputs.add-comment.max: + 2` fail-soft. If a subagent ever tries to post a comment, the cap + catches it; if it tries to bypass `safe-outputs` and call `gh` + directly, that is a contract violation worth surfacing in the next + panel review of the persona file itself. +- **Trigger-label removal is the LAST step.** Doing it earlier risks + another labeller racing the panel mid-run. The companion workflow + `pr-panel-label-reset.yml` handles verdict-label invalidation on + every new push (deterministically, no LLM). diff --git a/.github/skills/apm-review-panel/assets/ceo-return-schema.json b/.github/skills/apm-review-panel/assets/ceo-return-schema.json new file mode 100644 index 000000000..466aa01b5 --- /dev/null +++ b/.github/skills/apm-review-panel/assets/ceo-return-schema.json @@ -0,0 +1,23 @@ +{ + "$schema": "http://json-schema.org/draft-07/schema#", + "$id": "ceo-return-schema.json", + "title": "APM Review Panel - CEO Synthesizer Return Shape", + "description": "Shape the apm-ceo persona MUST return when invoked as the panel synthesizer. CEO writes ARBITRATION PROSE only - never picks the verdict. The orchestrator derives verdict deterministically from aggregated panelist required-counts (APPROVE iff sum(required) == 0, REJECT otherwise).", + "type": "object", + "required": ["arbitration"], + "additionalProperties": false, + "properties": { + "arbitration": { + "type": "string", + "description": "One-to-three paragraph synthesis. Resolve any disagreement between specialists. Surface strategic implications (positioning, breaking change, naming, scope). If specialists agreed and the change is uncontroversial, say so plainly. ASCII only." + }, + "dissent_notes": { + "type": "string", + "description": "Optional. When two or more panelists disagreed on whether a finding is REQUIRED vs NIT, name the disagreement and state which side the CEO sided with and why. Empty / omitted when no dissent." + }, + "growth_signal": { + "type": "string", + "description": "Optional. CEO surface for the side-channel note from oss-growth-hacker (conversion, narrative, breaking-change comms). Empty / omitted when not relevant." + } + } +} diff --git a/.github/skills/apm-review-panel/assets/panelist-return-schema.json b/.github/skills/apm-review-panel/assets/panelist-return-schema.json new file mode 100644 index 000000000..89a095b6b --- /dev/null +++ b/.github/skills/apm-review-panel/assets/panelist-return-schema.json @@ -0,0 +1,76 @@ +{ + "$schema": "http://json-schema.org/draft-07/schema#", + "$id": "panelist-return-schema.json", + "title": "APM Review Panel - Panelist Return Shape", + "description": "Shape every panel persona MUST return when invoked by the apm-review-panel skill. Two severity buckets only: required (blocks merge) and nits (skip-if-you-want, single line). NO third bucket. NO disposition - the orchestrator computes verdict deterministically from aggregated required-counts.", + "type": "object", + "required": ["persona", "required", "nits"], + "additionalProperties": false, + "properties": { + "persona": { + "type": "string", + "description": "Persona slug. MUST equal the .agent.md filename stem.", + "enum": [ + "python-architect", + "cli-logging-expert", + "devx-ux-expert", + "supply-chain-security-expert", + "oss-growth-hacker", + "auth-expert" + ] + }, + "active": { + "type": "boolean", + "description": "Set to false ONLY for auth-expert when the conditional rule does not activate it. All other personas MUST set active=true. When false, required and nits MUST be empty arrays and inactive_reason MUST be a one-sentence explanation citing the touched files.", + "default": true + }, + "inactive_reason": { + "type": "string", + "description": "Required when active=false. One sentence citing the touched files." + }, + "required": { + "type": "array", + "description": "Findings that MUST be addressed before merge. Empty array means this persona has no blocking concerns. Anything in this array causes the orchestrator to emit a REJECT verdict.", + "items": { "$ref": "#/definitions/finding" } + }, + "nits": { + "type": "array", + "description": "One-line suggestions the author can skip. NEVER block merge. No cap - trust the panelist's judgement.", + "items": { "$ref": "#/definitions/finding" } + }, + "extras": { + "type": "object", + "description": "Persona-specific structured payload. Reserved for python-architect mermaid diagrams and oss-growth-hacker side-channel notes. Never blocks; never affects verdict.", + "additionalProperties": true + } + }, + "definitions": { + "finding": { + "type": "object", + "required": ["summary", "rationale"], + "additionalProperties": false, + "properties": { + "summary": { + "type": "string", + "description": "One-line description of the finding. ASCII only. No emojis." + }, + "rationale": { + "type": "string", + "description": "WHY this matters. Cite the rule or pattern violated. ASCII only." + }, + "file": { + "type": "string", + "description": "Optional repo-relative path to the file." + }, + "line": { + "type": "integer", + "description": "Optional line number." + }, + "suggestion": { + "type": "string", + "description": "Optional concrete fix (diff hint, replacement code, command to run)." + } + } + } + } +} diff --git a/.github/skills/apm-review-panel/assets/verdict-template.md b/.github/skills/apm-review-panel/assets/verdict-template.md index ce032bdde..49b18d663 100644 --- a/.github/skills/apm-review-panel/assets/verdict-template.md +++ b/.github/skills/apm-review-panel/assets/verdict-template.md @@ -1,62 +1,110 @@ -## APM Review Panel Verdict +## APM Review Panel Verdict: -**Disposition**: +> ---- +### Required before merge ( items) -### Per-persona findings + +None. + +- [] :`> + - Why: + - > +- ... -**Python Architect**: +### Nits ( items, skip if you want) -**CLI Logging Expert**: + +None. + +- [] :`> +- ... -**DevX UX Expert**: +### CEO arbitration -**Supply Chain Security Expert**: + -**Auth Expert**: "> + +**Dissent resolved:** -**OSS Growth Hacker**: + +**Growth/positioning note:** --- -### CEO arbitration +
+Per-persona findings (full) - +#### Python Architect ---- + + + -### Required actions before merge +#### CLI Logging Expert -1. -2. <...> + ---- +#### DevX UX Expert + + + +#### Supply Chain Security Expert + + + +#### Auth Expert + + + + +Inactive -- + +#### OSS Growth Hacker + + -### Optional follow-ups +
-- -- <...> +Verdict computed deterministically: required findings across + active panelists. APPROVE iff N == 0. Push a new commit to clear +this verdict label automatically. diff --git a/.github/skills/apm-review-panel/evals/README.md b/.github/skills/apm-review-panel/evals/README.md new file mode 100644 index 000000000..04ae1503b --- /dev/null +++ b/.github/skills/apm-review-panel/evals/README.md @@ -0,0 +1,67 @@ +# Evals: apm-review-panel + +Per genesis Step 8 evals gate. Two categories: + +1. **TRIGGER EVALS** validate the dispatch description correctly + discriminates should-trigger queries from near-miss should-NOT + queries. Validation split is the ship gate (>= 0.5 on positives, + < 0.5 on negatives). + +2. **CONTENT EVALS** validate that the skill, when activated, produces + the JSON-derived top-loaded verdict comment in the shape declared + by `assets/verdict-template.md`. Run with-skill vs without-skill; + if the deltas are not visible, the skill is not adding value. + +## Files + +- `trigger-evals.json` -- 16 queries (8 should-trigger + 8 should-NOT), + 60/40 train/val split. +- `content-eval-clean-pr.md` -- synthetic clean PR scenario; expected + verdict = APPROVE, all panelists return `required: []`. +- `content-eval-rejected-pr.md` -- synthetic PR with one architectural + smell + one nit; expected verdict = REJECT, python-architect returns + one `required` finding. +- `run-verdict-harness.py` -- DETERMINISTIC harness covering the + parts of the panel that do NOT require an LLM: JSON schema + validation (S4 gate) and verdict computation. Five cases: two + positive (clean-pr APPROVE, rejected-pr REJECT) and three negative + (missing-nits, unknown-persona, disposition-leak). All five MUST + pass before merging any change to schemas, the SKILL.md execution + checklist verdict rule, or the persona output contracts. + +## How to run + +### Deterministic harness (free, runs anywhere) + +``` +uv run --with jsonschema python3 \ + .apm/skills/apm-review-panel/evals/run-verdict-harness.py +``` + +Expected: `RESULT: ALL PASS`. Proves schemas reject malformed shapes +and verdict math is correct. Does NOT prove the LLM panelists will +return well-formed JSON in practice -- that requires a real run. + +### Full end-to-end (option B, branch-pin test) + +The gh-aw workflow imports the panel skill from `microsoft/apm#main` +for security (anti-self-approval). Pre-merge validation requires a +temporary branch pin: + +1. On the feature branch, change `imports.packages` in + `.github/workflows/pr-review-panel.md` from `microsoft/apm#main` + to `microsoft/apm#`. +2. `gh aw compile pr-review-panel`. +3. Commit as `chore: TEMP pin to branch for end-to-end test`. +4. Open a tiny throwaway PR; label it `panel-review`. +5. Observe: 6 task threads spawn, JSON returns, verdict label + applied, `panel-review` removed, `panel-approved` (or + `panel-rejected`) set. Push a commit and watch the + deterministic `pr-panel-label-reset.yml` strip the verdict + label. +6. Revert the temp-pin commit before merge. + +### Trigger evals + +Trigger evals can be run via the genesis evals harness or any +dispatcher that loads the skill description and queries it. diff --git a/.github/skills/apm-review-panel/evals/content-eval-clean-pr.md b/.github/skills/apm-review-panel/evals/content-eval-clean-pr.md new file mode 100644 index 000000000..4a3651dee --- /dev/null +++ b/.github/skills/apm-review-panel/evals/content-eval-clean-pr.md @@ -0,0 +1,49 @@ +# Content eval: clean PR (expected APPROVE) + +## Synthetic PR input + +- Title: `docs(readme): clarify install command for fish shell users` +- Body: Adds a one-paragraph note in README.md under the install + section pointing fish users to the existing `eval (apm shellenv)` + recipe. No code changes. +- Diff: 4 added lines, 0 removed, 1 file (`README.md`). +- Files: `README.md` + +## Expected orchestrator behavior + +- Spawn 6 panelist tasks in parallel (5 mandatory + auth-expert with + `active: false` because no auth files touched). +- Each panelist returns `{persona: "", required: [], nits: [...]}` + where `nits` may contain at most a one-line wording suggestion. +- auth-expert returns `{persona: "auth-expert", active: false, + inactive_reason: "...README.md has no auth surface...", required: [], + nits: []}`. +- CEO synthesizer returns `{arbitration: "Documentation-only change + ... all specialists agree no required actions ...", dissent_notes + not present}`. +- Orchestrator computes `total_required = 0` -> verdict = APPROVE. +- Orchestrator emits ONE comment per `assets/verdict-template.md`: + - Header: `## APM Review Panel Verdict: APPROVE` + - Required section: "None." + - Nits section: rendered (or "None.") + - CEO arbitration paragraph + - Per-persona detail in collapsed `
` block +- Orchestrator applies `panel-approved` label. +- Orchestrator removes `panel-review` label. + +## Pass criteria + +- Comment header literally reads `APM Review Panel Verdict: APPROVE`. +- Required section reads "None.". +- Verdict label `panel-approved` is on the PR. +- `panel-review` label is no longer on the PR. +- Comment count from this run is exactly 1. + +## With-skill vs without-skill delta + +- With skill: structured top-loaded comment, deterministic verdict + label, fan-out across 6 distinct lenses. +- Without skill: free-form single-voice review with no label + automation, no schema, no guarantee on which lenses got applied. + +The structure IS the value. diff --git a/.github/skills/apm-review-panel/evals/content-eval-rejected-pr.md b/.github/skills/apm-review-panel/evals/content-eval-rejected-pr.md new file mode 100644 index 000000000..83a2f498b --- /dev/null +++ b/.github/skills/apm-review-panel/evals/content-eval-rejected-pr.md @@ -0,0 +1,62 @@ +# Content eval: PR with architectural smell + nit (expected REJECT) + +## Synthetic PR input + +- Title: `feat(install): add retry to package downloader` +- Body: Wraps `_download_package` in a 3-attempt retry loop with + fixed 2-second backoff. Adds a `retries` parameter to + `InstallPipeline.__init__` defaulting to 3. +- Diff: 35 added lines, 2 removed, 2 files. +- Files: + - `src/apm_cli/install/pipeline.py` (retry param wiring) + - `src/apm_cli/deps/github_downloader.py` (retry loop body inline + in `_download_package`) + +## Expected orchestrator behavior + +- Spawn 6 panelist tasks in parallel. +- Auth Expert activates (`github_downloader.py` is a fast-path file) + and returns `active: true` with possibly one finding about token + refresh on 401. +- Python Architect returns at least one `required` finding: + `{summary: "Inline retry loop violates Strategy pattern used + elsewhere in deps/", file: "src/apm_cli/deps/github_downloader.py", + rationale: "Codebase already has the chain-of-responsibility + retry pattern in AuthResolver; this should reuse it or extract a + shared retry helper rather than inlining a fixed loop"}`. +- CLI Logging Expert returns one `nit`: missing `[!]` warning when + retry kicks in. +- DevX UX Expert returns one `nit`: `retries=3` is undocumented in + CLI help. +- Supply Chain Security Expert returns 0 findings (no integrity + surface affected). +- OSS Growth Hacker returns 0 findings. +- CEO synthesizer returns `{arbitration: "Retry behavior is sound + but the inline loop misses the codebase's existing retry + abstraction. Python Architect's required action holds.", + dissent_notes not present}`. +- Orchestrator computes `total_required = 1` -> verdict = REJECT. +- Orchestrator emits ONE comment: + - Header: `## APM Review Panel Verdict: REJECT` + - Required section: 1 item, prefixed `[python-architect]`. + - Nits section: 2 items, prefixed `[cli-logging-expert]` and + `[devx-ux-expert]`. + - CEO arbitration paragraph. + - Per-persona detail in `
`. +- Orchestrator applies `panel-rejected` label. +- Orchestrator removes `panel-review` label. + +## Pass criteria + +- Comment header literally reads `APM Review Panel Verdict: REJECT`. +- Required section has exactly 1 item with `[python-architect]` prefix. +- Verdict label `panel-rejected` is on the PR. +- `panel-review` label is no longer on the PR. +- The auth-expert per-persona block renders (active = true). + +## With-skill vs without-skill delta + +Same as the clean-PR eval: with skill, the verdict is structured and +machine-actionable (label + binary outcome). Without skill, there is +no schema, no required-vs-nit discipline, and the developer cannot +tell at a glance whether the comment blocks merge or not. diff --git a/.github/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md b/.github/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md new file mode 100644 index 000000000..e0fdec31a --- /dev/null +++ b/.github/skills/apm-review-panel/evals/results-e2e-pr931-2026-04-28.md @@ -0,0 +1,71 @@ +# End-to-end eval result: PR #931 (2026-04-28) + +Real gh-aw run of the refactored panel against an unrelated open PR +in microsoft/apm. Branch-pin method (option B) -- workflow YAML +imported the panel skill from `microsoft/apm#refactor/review-panel-fanout` +instead of `#main`. + +## Setup + +- Workflow run: https://github.com/microsoft/apm/actions/runs/25069734881 +- Target PR: #931 (`docs(compile): clarify target routing docstring`, + +7/-3, no labels, external contributor) +- Trigger: `workflow_dispatch` from `refactor/review-panel-fanout` + with `pr_number=931` +- Skill source: `.apm/skills/apm-review-panel/SKILL.md` from this branch +- gh-aw version: v0.68.3 + +## Job graph (all SUCCESS) + +``` +pre_activation -> activation -> apm -> agent -> detection -> safe_outputs -> conclusion +``` + +All 7 jobs completed successfully on first run. + +## Acceptance criteria + +| # | Criterion | Result | +|---|---|---| +| 1 | Verdict header literally reads `## APM Review Panel Verdict: ` | PASS -- `## APM Review Panel Verdict: REJECT` | +| 2 | Top-loaded order: verdict -> required -> nits -> CEO -> per-persona | PASS -- exact order observed | +| 3 | Required findings rendered as `[persona] summary` + Why + Suggested fix | PASS -- 3 required items, all 3 panelists, all with file/line citations | +| 4 | Per-persona detail in collapsed `
` block at bottom | PASS -- `
Per-persona findings (full)...` | +| 5 | All 6 personas reported back (5 mandatory + auth-expert) | PASS -- python-architect, cli-logging-expert, devx-ux-expert, supply-chain-security-expert, auth-expert, oss-growth-hacker | +| 6 | Auth-expert correctly inactive when no auth surface touched | PASS -- "Inactive -- PR #931 modifies only a docstring ... no auth, token, credential, host-classification, or AuthResolver-related files are touched." | +| 7 | Verdict deterministic from `required[]` count (3 > 0 -> REJECT) | PASS -- correct mapping observed | +| 8 | Verdict label `panel-approved` XOR `panel-rejected` applied | PASS -- `panel-rejected` added | +| 9 | Comment count from this run = 1 (max:2 ceiling honored) | PASS -- 1 comment | +| 10 | Python Architect class diagram in extras passed through to per-persona block | PASS -- mermaid `classDiagram` rendered in details | + +## Surprise findings (positive) + +- The panel surfaced a REAL regression in PR #931 that prior reviewers + had not flagged: the proposed docstring silently drops `gemini` + routing that exists in the implementation. Three independent + panelists converged on the same root cause with zero dissent. The + CEO arbitration explicitly noted this convergence and recommended a + rebase-and-revise rather than a close. +- Growth Hacker correctly identified the contributor as external + (WilliamK112) and recommended a "warm rejection that models a + low-friction contribution loop" -- the fan-out captured a strategic + signal that a single-loop reviewer would have missed. + +## Failure modes NOT observed (negative confirmation) + +- No "approve with reservations" wording -- verdict is binary as + designed. +- No panelist comment posted to the PR -- only the orchestrator's + single synthesized comment. The single-writer interlock held. +- No `panel-review` label cleanup needed (the run was triggered via + workflow_dispatch, not via label, so there was no `panel-review` + label to remove). The label-removal path will be exercised when the + first real `panel-review`-labelled PR runs after merge. + +## Total cost + +One gh-aw workflow run, ~12 minutes wall-clock. No retries needed. + +## Artifact + +Full panel comment: https://github.com/microsoft/apm/pull/931#issuecomment- diff --git a/.github/skills/apm-review-panel/evals/results-label-reset-2026-04-28.md b/.github/skills/apm-review-panel/evals/results-label-reset-2026-04-28.md new file mode 100644 index 000000000..022548659 --- /dev/null +++ b/.github/skills/apm-review-panel/evals/results-label-reset-2026-04-28.md @@ -0,0 +1,63 @@ +# Label-reset workflow validation (2026-04-28) + +End-to-end test of `.github/workflows/pr-panel-label-reset.yml` — +the deterministic plain-Actions companion that strips verdict labels +when a PR receives a new commit (S7 DETERMINISTIC TOOL BRIDGE). + +## Setup + +Probe PR: microsoft/apm#1025 (closed) + +- Base: `refactor/review-panel-fanout` (so the workflow YAML is + loaded from the feature branch — required because + `pull_request: synchronize` reads the BASE-branch workflow file). +- Head: `test/label-reset-probe` (deleted after test). +- Initial state: one commit ahead of base, no labels. + +## Procedure + +1. Applied `panel-rejected` label manually (simulating a panel verdict). +2. Pushed a second commit to `test/label-reset-probe` head ref. +3. The push fires `pull_request: synchronize` on PR #1025. +4. Reset workflow runs from the base-branch YAML and iterates over + `panel-approved` / `panel-rejected`, calling `gh pr edit + --remove-label` for each. + +## Result: PASS + +| Check | Expected | Actual | +|-------|----------|--------| +| Workflow ran on synchronize | yes | yes — run 25076314935 | +| Run status | success | success (13s) | +| `panel-rejected` removed | yes | yes (labels: []) | +| Missing `panel-approved` handled gracefully | no error | yes (if/else swallows) | + +`panel-approved` path uses identical code in the same loop; testing +one path proves the logic. + +## Side finding + +The `panel-approved` repo label did NOT exist at test time. The +refactor PR introduces the workflow and the contract but did not +create the label itself. Created post-test: + +- `panel-approved` (#0E8A16, green) — Apm-review-panel verdict: APPROVE. Removed automatically on next push. +- `panel-rejected` (#B60205, red) — description and color updated for clarity. + +## Run log excerpt + +``` +Removed label 'panel-approved' (or it was not present). +Removed label 'panel-rejected' (or it was not present). +``` + +(Both branches of the conditional behave the same; the message text +is intentionally vague because gh CLI exit codes do not distinguish +"removed" from "was already absent".) + +## Cross-references + +- Workflow file: `.github/workflows/pr-panel-label-reset.yml` +- Probe PR: https://github.com/microsoft/apm/pull/1025 +- Run: https://github.com/microsoft/apm/actions/runs/25076314935 +- Companion e2e for the panel itself: `results-e2e-pr931-2026-04-28.md` diff --git a/.github/skills/apm-review-panel/evals/results-trigger-2026-04-28.md b/.github/skills/apm-review-panel/evals/results-trigger-2026-04-28.md new file mode 100644 index 000000000..42804e9e8 --- /dev/null +++ b/.github/skills/apm-review-panel/evals/results-trigger-2026-04-28.md @@ -0,0 +1,74 @@ +# Trigger Eval Self-Run (LLM dispatcher = me) + +Skill description (verbatim from `.apm/skills/apm-review-panel/SKILL.md`): +> Use this skill to run a multi-persona expert panel review on a labelled +> pull request in microsoft/apm. The panel fans out to five mandatory +> specialists plus one conditional auth specialist, all running in their +> own agent threads, and a CEO synthesizer. The orchestrator is the sole +> writer to the PR: one comment plus exactly one verdict label +> (panel-approved or panel-rejected, derived deterministically from the +> aggregated findings). Activate when a non-trivial PR needs cross-cutting +> review (architecture, CLI logging, DevX UX, supply-chain security, +> growth/positioning, optionally auth, with CEO arbitration). + +For each query I record: +- `expected` = label from trigger-evals.json (TRIGGER / NOT) +- `judged` = my dispatch decision as an LLM reading the description above +- `match` = expected == judged + +## Should-trigger TRAIN (5) + +| # | Query | Expected | Judged | Match | Reasoning | +|---|---|---|---|---|---| +| 1 | "review this PR with the expert panel" | TRIGGER | TRIGGER | YES | "expert panel" + "PR" maps cleanly to "multi-persona expert panel review on a labelled pull request" | +| 2 | "run the apm-review-panel against this branch" | TRIGGER | TRIGGER | YES | exact name match | +| 3 | "do a multi-persona review of PR #1234" | TRIGGER | TRIGGER | YES | "multi-persona" is verbatim in description | +| 4 | "panel-review this PR" | TRIGGER | TRIGGER | YES | "panel" + "PR" + uses the trigger-label term as verb | +| 5 | "get the architecture + security + UX review on this PR" | TRIGGER | TRIGGER | YES | enumerates 3 of the 6 listed lenses + PR scope | + +Train pass rate: 5/5 = 1.00 + +## Should-trigger VAL (3) + +| # | Query | Expected | Judged | Match | Reasoning | +|---|---|---|---|---|---| +| 6 | "ask the python architect, cli logging expert, and security expert to weigh in on this PR" | TRIGGER | TRIGGER | YES | enumerates 3 of the 5 mandatory specialists by name + PR scope | +| 7 | "have the apm panel verdict this PR" | TRIGGER | TRIGGER | YES | "apm panel" + "verdict" matches "verdict label (panel-approved or panel-rejected)" | +| 8 | "i need a cross-cutting expert review of this change before merge" | TRIGGER | TRIGGER | YES | "cross-cutting" is verbatim in description; "before merge" implies PR scope | + +Val pass rate: 3/3 = 1.00 (gate: >= 0.5) -- PASS + +## Should-NOT-trigger TRAIN (5) + +| # | Query | Expected | Judged | Match | Reasoning | +|---|---|---|---|---|---| +| 9 | "give me a code review of this file" | NOT | NOT | YES | single file, no PR framing, no panel language; would route to a generic code-review skill | +| 10 | "what does the python architect think of this class hierarchy" | NOT | NOT | YES | single persona, not panel; description requires "multi-persona" | +| 11 | "review this docstring" | NOT | NOT | YES | scope is one docstring, not a PR; no panel framing | +| 12 | "fix the lint errors in src/" | NOT | NOT | YES | mechanical task, not review; description is for review only | +| 13 | "what is the apm-review-panel skill" | NOT | NOT | YES | meta-question about the skill, not a request to run it; an info skill should answer this | + +Train pass rate (negative): 5/5 = 1.00 (correct rejection) + +## Should-NOT-trigger VAL (3) + +| # | Query | Expected | Judged | Match | Reasoning | +|---|---|---|---|---|---| +| 14 | "explain how the auth resolver works" | NOT | NOT | YES | explanation request, not review; auth-expert mention is incidental | +| 15 | "summarize the diff of PR #1234" | NOT | NOT | YES | summarization, not multi-persona review with verdict | +| 16 | "draft a PR description for me" | NOT | NOT | YES | authoring, not reviewing; pr-description-skill territory | + +Val pass rate (negative): 3/3 = 1.00 (gate: < 0.5 false-positive rate, i.e. >= 0.5 correct rejection) -- PASS + +## Summary + +| Split | Should-trigger correct | Should-NOT correct | Combined | +|---|---|---|---| +| Train (n=10) | 5/5 = 1.00 | 5/5 = 1.00 | 10/10 = 1.00 | +| Val (n=6) | 3/3 = 1.00 | 3/3 = 1.00 | 6/6 = 1.00 | + +**SHIP GATE (validation split):** +- Should-trigger pass rate: 1.00 >= 0.5 -- PASS +- Should-NOT pass rate: 1.00 >= 0.5 -- PASS + +**RESULT: PASS.** Dispatch description discriminates cleanly with no false positives or false negatives in this run. Caveat: I am a single LLM judging my own routing; the canonical eval would run a different model as dispatcher and average over multiple samples. But this is real LLM judgment, not hand-waving. diff --git a/.github/skills/apm-review-panel/evals/run-verdict-harness.py b/.github/skills/apm-review-panel/evals/run-verdict-harness.py new file mode 100644 index 000000000..b504f0ea3 --- /dev/null +++ b/.github/skills/apm-review-panel/evals/run-verdict-harness.py @@ -0,0 +1,277 @@ +#!/usr/bin/env python3 +""" +Deterministic verdict harness for the apm-review-panel skill. + +Validates two things end-to-end WITHOUT spinning up an LLM: +1. Synthetic panelist JSON returns conform to panelist-return-schema.json. +2. The orchestrator's deterministic verdict rule produces the expected + verdict + label mapping per the eval scenarios. + +This is genesis Step 8 evals gate, deterministic slice. It does NOT +prove that a real LLM panelist will return well-formed JSON (that +requires option B, the branch-pin end-to-end test). It DOES prove that +when a panelist returns well-formed JSON, the orchestrator math is +correct and the schema rejects malformed shapes. + +Usage: + uv run --with jsonschema python3 \ + .apm/skills/apm-review-panel/evals/run-verdict-harness.py +""" + +from __future__ import annotations + +import json +import sys +from pathlib import Path + +import jsonschema + +ROOT = Path(__file__).resolve().parents[1] +SCHEMA_PANELIST = json.loads( + (ROOT / "assets" / "panelist-return-schema.json").read_text() +) +SCHEMA_CEO = json.loads((ROOT / "assets" / "ceo-return-schema.json").read_text()) + + +def derive_verdict(panelists: list[dict]) -> tuple[str, str]: + """ + Pure verdict derivation per SKILL.md execution checklist step 6: + total_required = sum(len(p["required"]) for p in panelists if p.get("active", True)) + verdict = "APPROVE" if total_required == 0 else "REJECT" + """ + total_required = sum( + len(p["required"]) for p in panelists if p.get("active", True) + ) + verdict = "APPROVE" if total_required == 0 else "REJECT" + label = "panel-approved" if verdict == "APPROVE" else "panel-rejected" + return verdict, label + + +def case_clean_pr() -> dict: + """Synthetic returns for the docs-only PR scenario.""" + panelists = [ + {"persona": "python-architect", "required": [], "nits": []}, + { + "persona": "cli-logging-expert", + "required": [], + "nits": [ + { + "summary": "Wording: 'aliasing' could be 'aliases'", + "rationale": "Reads more naturally to npm/pip users", + "file": "README.md", + "line": 142, + } + ], + }, + {"persona": "devx-ux-expert", "required": [], "nits": []}, + {"persona": "supply-chain-security-expert", "required": [], "nits": []}, + {"persona": "oss-growth-hacker", "required": [], "nits": []}, + { + "persona": "auth-expert", + "active": False, + "inactive_reason": "README.md has no auth surface; no fast-path file matched.", + "required": [], + "nits": [], + }, + ] + ceo = { + "arbitration": ( + "Documentation-only change pointing fish users to an existing " + "recipe. All five mandatory specialists agree no required actions; " + "auth-expert correctly inactive." + ) + } + return { + "name": "clean-pr", + "panelists": panelists, + "ceo": ceo, + "expected_verdict": "APPROVE", + "expected_label": "panel-approved", + } + + +def case_rejected_pr() -> dict: + """Synthetic returns for the inline-retry-loop PR scenario.""" + panelists = [ + { + "persona": "python-architect", + "required": [ + { + "summary": "Inline retry loop violates Strategy pattern in deps/", + "rationale": ( + "Codebase already has chain-of-responsibility retry " + "via AuthResolver; reuse or extract a shared helper " + "instead of inlining a fixed loop." + ), + "file": "src/apm_cli/deps/github_downloader.py", + "line": 218, + "suggestion": "Extract retry into deps/_retry.py with same shape as AuthResolver chain.", + } + ], + "nits": [], + }, + { + "persona": "cli-logging-expert", + "required": [], + "nits": [ + { + "summary": "Missing [!] warning when retry kicks in", + "rationale": "User-visible network blip; STATUS_SYMBOLS guidance applies.", + "file": "src/apm_cli/deps/github_downloader.py", + } + ], + }, + { + "persona": "devx-ux-expert", + "required": [], + "nits": [ + { + "summary": "retries=3 undocumented in CLI help", + "rationale": "Mental model gap; npm/pip users expect to see configurable retries surfaced.", + } + ], + }, + {"persona": "supply-chain-security-expert", "required": [], "nits": []}, + {"persona": "oss-growth-hacker", "required": [], "nits": []}, + { + "persona": "auth-expert", + "active": True, + "required": [], + "nits": [ + { + "summary": "Consider 401-aware retry", + "rationale": "Stale-PAT handling already exists in AuthResolver; align.", + } + ], + }, + ] + ceo = { + "arbitration": ( + "Retry behavior is sound but the inline loop misses the " + "codebase's existing retry abstraction. Python Architect's " + "required action holds; nits across logging/devx/auth align." + ), + "dissent_notes": "", + } + return { + "name": "rejected-pr", + "panelists": panelists, + "ceo": ceo, + "expected_verdict": "REJECT", + "expected_label": "panel-rejected", + } + + +def case_malformed_panelist() -> dict: + """Negative case: panelist forgot the `nits` field. S4 gate must fail.""" + return { + "name": "malformed-missing-nits", + "panelist": {"persona": "python-architect", "required": []}, + "expect_schema_error": True, + } + + +def case_unknown_persona() -> dict: + """Negative case: persona enum violation.""" + return { + "name": "malformed-unknown-persona", + "panelist": { + "persona": "unknown-persona", + "required": [], + "nits": [], + }, + "expect_schema_error": True, + } + + +def case_disposition_leak() -> dict: + """Negative case: panelist tried to add a verdict field. additionalProperties:false catches.""" + return { + "name": "malformed-disposition-leak", + "panelist": { + "persona": "python-architect", + "required": [], + "nits": [], + "disposition": "APPROVE", + }, + "expect_schema_error": True, + } + + +def run_positive(case: dict) -> tuple[bool, str]: + """Validate every panelist + CEO return, then check verdict.""" + notes = [] + for p in case["panelists"]: + try: + jsonschema.validate(p, SCHEMA_PANELIST) + except jsonschema.ValidationError as e: + return False, f"panelist {p.get('persona')} failed schema: {e.message}" + notes.append( + f" - {p['persona']}: active={p.get('active', True)}, " + f"required={len(p['required'])}, nits={len(p['nits'])}" + ) + + try: + jsonschema.validate(case["ceo"], SCHEMA_CEO) + except jsonschema.ValidationError as e: + return False, f"CEO failed schema: {e.message}" + + verdict, label = derive_verdict(case["panelists"]) + if verdict != case["expected_verdict"]: + return ( + False, + f"expected verdict={case['expected_verdict']}, got {verdict}", + ) + if label != case["expected_label"]: + return False, f"expected label={case['expected_label']}, got {label}" + + return True, "\n".join( + notes + + [ + f" -> verdict={verdict}, label={label} (expected matched)", + ] + ) + + +def run_negative(case: dict) -> tuple[bool, str]: + """Validate that schema rejects the malformed shape.""" + try: + jsonschema.validate(case["panelist"], SCHEMA_PANELIST) + except jsonschema.ValidationError as e: + return True, f"correctly rejected: {e.message.splitlines()[0]}" + return False, "schema accepted a malformed return; S4 gate would have let it through" + + +def main() -> int: + cases_pos = [case_clean_pr(), case_rejected_pr()] + cases_neg = [case_malformed_panelist(), case_unknown_persona(), case_disposition_leak()] + + ok = True + print("APM Review Panel - Deterministic Verdict Harness") + print("=" * 60) + print() + print("Positive cases (well-formed JSON, expected verdict):") + for c in cases_pos: + passed, msg = run_positive(c) + marker = "[+] PASS" if passed else "[x] FAIL" + print(f"{marker} {c['name']}") + print(msg) + print() + ok = ok and passed + + print("Negative cases (malformed JSON, S4 schema gate must reject):") + for c in cases_neg: + passed, msg = run_negative(c) + marker = "[+] PASS" if passed else "[x] FAIL" + print(f"{marker} {c['name']}") + print(f" {msg}") + print() + ok = ok and passed + + print("=" * 60) + print("RESULT:", "ALL PASS" if ok else "FAIL") + return 0 if ok else 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/.github/skills/apm-review-panel/evals/trigger-evals.json b/.github/skills/apm-review-panel/evals/trigger-evals.json new file mode 100644 index 000000000..cbedde29d --- /dev/null +++ b/.github/skills/apm-review-panel/evals/trigger-evals.json @@ -0,0 +1,31 @@ +{ + "description": "Trigger evals for the apm-review-panel skill dispatch description. 8 should-trigger + 8 should-NOT-trigger. 60/40 train/val split (10 train, 6 val). Validation split is the ship gate.", + "should_trigger": { + "train": [ + "review this PR with the expert panel", + "run the apm-review-panel against this branch", + "do a multi-persona review of PR #1234", + "panel-review this PR", + "get the architecture + security + UX review on this PR" + ], + "val": [ + "ask the python architect, cli logging expert, and security expert to weigh in on this PR", + "have the apm panel verdict this PR", + "i need a cross-cutting expert review of this change before merge" + ] + }, + "should_not_trigger": { + "train": [ + "give me a code review of this file", + "what does the python architect think of this class hierarchy", + "review this docstring", + "fix the lint errors in src/", + "what is the apm-review-panel skill" + ], + "val": [ + "explain how the auth resolver works", + "summarize the diff of PR #1234", + "draft a PR description for me" + ] + } +} diff --git a/.github/workflows/pr-panel-label-reset.yml b/.github/workflows/pr-panel-label-reset.yml new file mode 100644 index 000000000..28dbd25a5 --- /dev/null +++ b/.github/workflows/pr-panel-label-reset.yml @@ -0,0 +1,43 @@ +name: PR Panel Label Reset + +# Purpose: when a PR with a panel verdict label gets a new commit pushed, +# strip both verdict labels (panel-approved, panel-rejected) so the verdict +# does not lie. This is a deterministic operation - no judgement, no LLM. +# Per genesis S7 DETERMINISTIC TOOL BRIDGE: facts-that-must-be-true with +# zero judgement do not belong in an LLM container. +# +# The companion gh-aw workflow pr-review-panel.md handles the smart side +# (panel orchestration, verdict derivation, label application). This file +# handles the dumb side (verdict invalidation on new push). +# +# This workflow does NOT remove the panel-review trigger label. That is +# done inside pr-review-panel.md after the panel completes (so the +# trigger label stays idempotent: re-add to re-run). + +on: + pull_request: + types: [synchronize] + +permissions: + pull-requests: write + contents: read + +jobs: + reset: + runs-on: ubuntu-latest + steps: + - name: Remove stale panel verdict labels + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + PR_NUMBER: ${{ github.event.pull_request.number }} + REPO: ${{ github.repository }} + run: | + for label in panel-approved panel-rejected; do + # Best-effort removal: silently succeed when the label is + # not currently applied (gh exits non-zero in that case). + if gh pr edit "$PR_NUMBER" --repo "$REPO" --remove-label "$label" 2>&1; then + echo "Removed label '$label' (or it was not present)." + else + echo "Label '$label' not present - nothing to do." + fi + done diff --git a/.github/workflows/pr-review-panel.lock.yml b/.github/workflows/pr-review-panel.lock.yml index a095d6997..2a0df631c 100644 --- a/.github/workflows/pr-review-panel.lock.yml +++ b/.github/workflows/pr-review-panel.lock.yml @@ -1,4 +1,4 @@ -# gh-aw-metadata: {"schema_version":"v3","frontmatter_hash":"a8730b9c31fe50d5a50ee9d99212bc6da387fc069501aa0a425bdd8951ba9541","compiler_version":"v0.68.3","strict":true,"agent_id":"copilot"} +# gh-aw-metadata: {"schema_version":"v3","frontmatter_hash":"b815b0974637d563cb04400baa74bf4b2d661724ea984b770276ef17e827ae8a","compiler_version":"v0.68.3","strict":true,"agent_id":"copilot"} # gh-aw-manifest: {"version":1,"secrets":["COPILOT_GITHUB_TOKEN","GH_AW_GITHUB_MCP_SERVER_TOKEN","GH_AW_GITHUB_TOKEN","GH_AW_PLUGINS_TOKEN","GITHUB_TOKEN"],"actions":[{"repo":"actions/checkout","sha":"de0fac2e4500dabe0009e67214ff5f5447ce83dd","version":"v6.0.2"},{"repo":"actions/download-artifact","sha":"3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c","version":"v8.0.1"},{"repo":"actions/github-script","sha":"373c709c69115d41ff229c7e5df9f8788daa9553","version":"v9"},{"repo":"actions/upload-artifact","sha":"043fb46d1a93c77aae656e7c1c64a875d1fc6a0a","version":"v7"},{"repo":"github/gh-aw-actions/setup","sha":"ba90f2186d7ad780ec640f364005fa24e797b360","version":"v0.68.3"},{"repo":"microsoft/apm-action","sha":"9fe9337ef58b5e620e0113071ceb47a6a8a232f7","version":"v1.4.2"}],"containers":[{"image":"ghcr.io/github/gh-aw-firewall/agent:0.25.20"},{"image":"ghcr.io/github/gh-aw-firewall/api-proxy:0.25.20"},{"image":"ghcr.io/github/gh-aw-firewall/squid:0.25.20"},{"image":"ghcr.io/github/gh-aw-mcpg:v0.2.19"},{"image":"ghcr.io/github/github-mcp-server:v0.32.0"},{"image":"node:lts-alpine"}]} # ___ _ _ # / _ \ | | (_) @@ -213,16 +213,16 @@ jobs: run: | bash "${RUNNER_TEMP}/gh-aw/actions/create_prompt_first.sh" { - cat << 'GH_AW_PROMPT_daad15abf8de7f2c_EOF' + cat << 'GH_AW_PROMPT_6a61e25df8c4956d_EOF' - GH_AW_PROMPT_daad15abf8de7f2c_EOF + GH_AW_PROMPT_6a61e25df8c4956d_EOF cat "${RUNNER_TEMP}/gh-aw/prompts/xpia.md" cat "${RUNNER_TEMP}/gh-aw/prompts/temp_folder_prompt.md" cat "${RUNNER_TEMP}/gh-aw/prompts/markdown.md" cat "${RUNNER_TEMP}/gh-aw/prompts/safe_outputs_prompt.md" - cat << 'GH_AW_PROMPT_daad15abf8de7f2c_EOF' + cat << 'GH_AW_PROMPT_6a61e25df8c4956d_EOF' - Tools: add_comment(max:7), missing_tool, missing_data, noop + Tools: add_comment(max:2), add_labels, remove_labels, missing_tool, missing_data, noop The following GitHub context information is available for this workflow: @@ -252,15 +252,15 @@ jobs: {{/if}} - GH_AW_PROMPT_daad15abf8de7f2c_EOF + GH_AW_PROMPT_6a61e25df8c4956d_EOF cat "${RUNNER_TEMP}/gh-aw/prompts/github_mcp_tools_with_safeoutputs_prompt.md" - cat << 'GH_AW_PROMPT_daad15abf8de7f2c_EOF' + cat << 'GH_AW_PROMPT_6a61e25df8c4956d_EOF' {{#runtime-import .github/workflows/pr-review-panel.md}} - GH_AW_PROMPT_daad15abf8de7f2c_EOF + GH_AW_PROMPT_6a61e25df8c4956d_EOF } > "$GH_AW_PROMPT" - name: Interpolate variables and render templates uses: actions/github-script@373c709c69115d41ff229c7e5df9f8788daa9553 # v9 @@ -450,15 +450,17 @@ jobs: mkdir -p "${RUNNER_TEMP}/gh-aw/safeoutputs" mkdir -p /tmp/gh-aw/safeoutputs mkdir -p /tmp/gh-aw/mcp-logs/safeoutputs - cat > "${RUNNER_TEMP}/gh-aw/safeoutputs/config.json" << 'GH_AW_SAFE_OUTPUTS_CONFIG_f053aa54733793da_EOF' - {"add_comment":{"max":7},"create_report_incomplete_issue":{},"missing_data":{},"missing_tool":{},"noop":{"max":1,"report-as-issue":"true"},"report_incomplete":{}} - GH_AW_SAFE_OUTPUTS_CONFIG_f053aa54733793da_EOF + cat > "${RUNNER_TEMP}/gh-aw/safeoutputs/config.json" << 'GH_AW_SAFE_OUTPUTS_CONFIG_8270e0dce94dfaaa_EOF' + {"add_comment":{"max":2},"add_labels":{"allowed":["panel-approved","panel-rejected"],"max":1},"create_report_incomplete_issue":{},"missing_data":{},"missing_tool":{},"noop":{"max":1,"report-as-issue":"true"},"remove_labels":{"allowed":["panel-review"],"max":1},"report_incomplete":{}} + GH_AW_SAFE_OUTPUTS_CONFIG_8270e0dce94dfaaa_EOF - name: Write Safe Outputs Tools env: GH_AW_TOOLS_META_JSON: | { "description_suffixes": { - "add_comment": " CONSTRAINTS: Maximum 7 comment(s) can be added. Supports reply_to_id for discussion threading." + "add_comment": " CONSTRAINTS: Maximum 2 comment(s) can be added. Supports reply_to_id for discussion threading.", + "add_labels": " CONSTRAINTS: Maximum 1 label(s) can be added. Only these labels are allowed: [\"panel-approved\" \"panel-rejected\"].", + "remove_labels": " CONSTRAINTS: Maximum 1 label(s) can be removed. Only these labels can be removed: [panel-review]." }, "repo_params": {}, "dynamic_tools": [] @@ -487,6 +489,25 @@ jobs: } } }, + "add_labels": { + "defaultMax": 5, + "fields": { + "item_number": { + "issueNumberOrTemporaryId": true + }, + "labels": { + "required": true, + "type": "array", + "itemType": "string", + "itemSanitize": true, + "itemMaxLength": 128 + }, + "repo": { + "type": "string", + "maxLength": 256 + } + } + }, "missing_data": { "defaultMax": 20, "fields": { @@ -544,6 +565,25 @@ jobs: } } }, + "remove_labels": { + "defaultMax": 5, + "fields": { + "item_number": { + "issueNumberOrTemporaryId": true + }, + "labels": { + "required": true, + "type": "array", + "itemType": "string", + "itemSanitize": true, + "itemMaxLength": 128 + }, + "repo": { + "type": "string", + "maxLength": 256 + } + } + }, "report_incomplete": { "defaultMax": 5, "fields": { @@ -636,7 +676,7 @@ jobs: export MCP_GATEWAY_DOCKER_COMMAND='docker run -i --rm --network host -v /var/run/docker.sock:/var/run/docker.sock -e MCP_GATEWAY_PORT -e MCP_GATEWAY_DOMAIN -e MCP_GATEWAY_API_KEY -e MCP_GATEWAY_PAYLOAD_DIR -e MCP_GATEWAY_PAYLOAD_SIZE_THRESHOLD -e DEBUG -e MCP_GATEWAY_LOG_DIR -e GH_AW_MCP_LOG_DIR -e GH_AW_SAFE_OUTPUTS -e GH_AW_SAFE_OUTPUTS_CONFIG_PATH -e GH_AW_SAFE_OUTPUTS_TOOLS_PATH -e GH_AW_ASSETS_BRANCH -e GH_AW_ASSETS_MAX_SIZE_KB -e GH_AW_ASSETS_ALLOWED_EXTS -e DEFAULT_BRANCH -e GITHUB_MCP_SERVER_TOKEN -e GITHUB_MCP_GUARD_MIN_INTEGRITY -e GITHUB_MCP_GUARD_REPOS -e GITHUB_REPOSITORY -e GITHUB_SERVER_URL -e GITHUB_SHA -e GITHUB_WORKSPACE -e GITHUB_TOKEN -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RUN_ATTEMPT -e GITHUB_JOB -e GITHUB_ACTION -e GITHUB_EVENT_NAME -e GITHUB_EVENT_PATH -e GITHUB_ACTOR -e GITHUB_ACTOR_ID -e GITHUB_TRIGGERING_ACTOR -e GITHUB_WORKFLOW -e GITHUB_WORKFLOW_REF -e GITHUB_WORKFLOW_SHA -e GITHUB_REF -e GITHUB_REF_NAME -e GITHUB_REF_TYPE -e GITHUB_HEAD_REF -e GITHUB_BASE_REF -e GH_AW_SAFE_OUTPUTS_PORT -e GH_AW_SAFE_OUTPUTS_API_KEY -v /tmp/gh-aw/mcp-payloads:/tmp/gh-aw/mcp-payloads:rw -v /opt:/opt:ro -v /tmp:/tmp:rw -v '"${GITHUB_WORKSPACE}"':'"${GITHUB_WORKSPACE}"':rw ghcr.io/github/gh-aw-mcpg:v0.2.19' mkdir -p /home/runner/.copilot - cat << GH_AW_MCP_CONFIG_f34a70f444ade5b1_EOF | bash "${RUNNER_TEMP}/gh-aw/actions/start_mcp_gateway.sh" + cat << GH_AW_MCP_CONFIG_2bd9111d9b098208_EOF | bash "${RUNNER_TEMP}/gh-aw/actions/start_mcp_gateway.sh" { "mcpServers": { "github": { @@ -677,7 +717,7 @@ jobs: "payloadDir": "${MCP_GATEWAY_PAYLOAD_DIR}" } } - GH_AW_MCP_CONFIG_f34a70f444ade5b1_EOF + GH_AW_MCP_CONFIG_2bd9111d9b098208_EOF - name: Download activation artifact uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c # v8.0.1 with: @@ -1319,7 +1359,7 @@ jobs: GH_AW_ALLOWED_DOMAINS: "*.githubusercontent.com,api.business.githubcopilot.com,api.enterprise.githubcopilot.com,api.github.com,api.githubcopilot.com,api.individual.githubcopilot.com,api.snapcraft.io,archive.ubuntu.com,azure.archive.ubuntu.com,codeload.github.com,crl.geotrust.com,crl.globalsign.com,crl.identrust.com,crl.sectigo.com,crl.thawte.com,crl.usertrust.com,crl.verisign.com,crl3.digicert.com,crl4.digicert.com,crls.ssl.com,docs.github.com,github-cloud.githubusercontent.com,github-cloud.s3.amazonaws.com,github.blog,github.com,github.githubassets.com,host.docker.internal,json-schema.org,json.schemastore.org,keyserver.ubuntu.com,lfs.github.com,objects.githubusercontent.com,ocsp.digicert.com,ocsp.geotrust.com,ocsp.globalsign.com,ocsp.identrust.com,ocsp.sectigo.com,ocsp.ssl.com,ocsp.thawte.com,ocsp.usertrust.com,ocsp.verisign.com,packagecloud.io,packages.cloud.google.com,packages.microsoft.com,ppa.launchpad.net,raw.githubusercontent.com,registry.npmjs.org,s.symcb.com,s.symcd.com,security.ubuntu.com,telemetry.enterprise.githubcopilot.com,ts-crl.ws.symantec.com,ts-ocsp.ws.symantec.com,www.googleapis.com" GITHUB_SERVER_URL: ${{ github.server_url }} GITHUB_API_URL: ${{ github.api_url }} - GH_AW_SAFE_OUTPUTS_HANDLER_CONFIG: "{\"add_comment\":{\"max\":7},\"create_report_incomplete_issue\":{},\"missing_data\":{},\"missing_tool\":{},\"noop\":{\"max\":1,\"report-as-issue\":\"true\"},\"report_incomplete\":{}}" + GH_AW_SAFE_OUTPUTS_HANDLER_CONFIG: "{\"add_comment\":{\"max\":2},\"add_labels\":{\"allowed\":[\"panel-approved\",\"panel-rejected\"],\"max\":1},\"create_report_incomplete_issue\":{},\"missing_data\":{},\"missing_tool\":{},\"noop\":{\"max\":1,\"report-as-issue\":\"true\"},\"remove_labels\":{\"allowed\":[\"panel-review\"],\"max\":1},\"report_incomplete\":{}}" with: github-token: ${{ secrets.GH_AW_GITHUB_TOKEN || secrets.GITHUB_TOKEN }} script: | diff --git a/.github/workflows/pr-review-panel.md b/.github/workflows/pr-review-panel.md index b93557a0c..ae3cc0c4d 100644 --- a/.github/workflows/pr-review-panel.md +++ b/.github/workflows/pr-review-panel.md @@ -25,9 +25,16 @@ description: Multi-persona expert panel review of labelled PRs, posting a single # `gh pr diff` which return inert text # - imports are pinned to microsoft/apm#main (panel skill + # persona definitions are trusted, not from the PR) -# - the only write surface is safe-outputs.add-comment (max 7 -# is a safety ceiling; the agent is instructed to emit one -# synthesized verdict comment) +# - write surfaces are tightly scoped: +# add-comment max 2 (one CEO comment + one safety overflow) +# add-labels allowed [panel-approved, panel-rejected] max 1 +# (mutually exclusive verdict; orchestrator emits exactly one) +# remove-labels allowed [panel-review] max 1 +# (clear the trigger label after the run so re-applying it +# re-runs the panel idempotently) +# The verdict labels themselves are stripped on every new push +# by the deterministic companion workflow pr-panel-label-reset.yml +# (plain GitHub Actions, no LLM). # - `roles: [admin, maintainer, write]` ensures only repo # maintainers can trigger -- matches the trust model that # applying the `panel-review` label requires write access. @@ -99,8 +106,21 @@ network: - github safe-outputs: + # Single CEO comment per panel run. max:2 is a fail-soft ceiling; the + # one-comment discipline lives inside the apm-review-panel skill. add-comment: - max: 7 + max: 2 + # Verdict label. Mutually exclusive (orchestrator picks exactly one). + # The companion workflow pr-panel-label-reset.yml strips both on every + # new push so a stale verdict can never linger past a code change. + add-labels: + allowed: [panel-approved, panel-rejected] + max: 1 + # Trigger label cleanup. Removed after the run so re-applying + # `panel-review` re-triggers the panel cleanly. + remove-labels: + allowed: [panel-review] + max: 1 timeout-minutes: 30 --- diff --git a/apm.lock.yaml b/apm.lock.yaml index b8facdf76..eaa002865 100644 --- a/apm.lock.yaml +++ b/apm.lock.yaml @@ -33,16 +33,16 @@ local_deployed_files: - .github/skills/supply-chain-security local_deployed_file_hashes: .github/agents/agentic-workflows.agent.md: sha256:d1ea2d038e2af8be11d6c95b3213b03b9777fae46f0438efa95d5a803e6c3765 - .github/agents/apm-ceo.agent.md: sha256:dfc436e6eeffc7ec1c2f556edb78e4a5166ac36d162ea720d08b4b79af0a9938 + .github/agents/apm-ceo.agent.md: sha256:82b259a9fbef2ba44acdb354921f23ee1caf4a70df2dc8bfddac64b1ede95630 .github/agents/apm-primitives-architect.agent.md: sha256:6c01eab74ba18d70f21d45010d636cc6535d63cee81da12e61898d8036e0b028 - .github/agents/auth-expert.agent.md: sha256:85409aab097cf239e5aa7ad61db6c4586be9884ef64a45fa9c894017b046b56b - .github/agents/cli-logging-expert.agent.md: sha256:24bf6c4b420c42292700ad0eb80b53d275be5c9cb186d471d706211f8419e3b9 - .github/agents/devx-ux-expert.agent.md: sha256:3472680f43b2b4411b9437ec31529216afd4e576e1874c14430935e7f1ded1f2 + .github/agents/auth-expert.agent.md: sha256:18264a933cba432b77d133e6ae11eee294c92ed245629af8c9b7a5bb7a9a300c + .github/agents/cli-logging-expert.agent.md: sha256:3ed7fe1a2e28e03a40311d4999ef54330908920d6515205708dd3f037abfcf0f + .github/agents/devx-ux-expert.agent.md: sha256:8310d130cca5bc548baf4a2a84e3c9680c9dc5d83a2718150636896ab2aa1f30 .github/agents/doc-analyser.agent.md: sha256:47b1d0204904b786c19d4fe84343e86cdab6f92f862f676ba741ffe6e1385679 .github/agents/doc-writer.agent.md: sha256:328a5b9ea079869b8ccd914a6e2135c204225a5eedb42f59a1ec73058f7f0b47 - .github/agents/oss-growth-hacker.agent.md: sha256:8d18f5be46913c40ad3aa66fb984575a88988cfac402d39353cdfb09f7e582c5 - .github/agents/python-architect.agent.md: sha256:a70b1be6cd877a85414a65a1db8117be7dbecc3865d98045f46ed14462fda099 - .github/agents/supply-chain-security-expert.agent.md: sha256:9a4e731b12e7658f71d54c22e90f80ce0c45e3eacbb069b8505ed96ec9e79ba5 + .github/agents/oss-growth-hacker.agent.md: sha256:1cd56bb78ab37d52c50e45ab69d759f775cd49cdf35981b3dc6c4004315c6b83 + .github/agents/python-architect.agent.md: sha256:d268f86b55e137ecb09d137f6c3e12ecb9acceff34701c04f11dbdd1e08f828b + .github/agents/supply-chain-security-expert.agent.md: sha256:8fb8cc426d6af17ba084a28b3f026c2b475b62e3ca63ed2f88b83bd823f877af .github/instructions/changelog.instructions.md: sha256:1e51ec4c74e847967962bd279dc4c6e582c5d3578490b3c28d5f3acd3e05f73e .github/instructions/cicd.instructions.md: sha256:9c0fafc74f743aa97e5adba2168d66c9e3a327b135065e3b804bdbb5f04cda5d .github/instructions/cli.instructions.md: sha256:8e39e8d5047ce88575cb02f87c2bcede584dfef258bd86f7466c7badf136541a