Add evals suite to pr-description-skill (genesis-driven) by danielmeppiel · Pull Request #985 · microsoft/apm

danielmeppiel · 2026-04-27T09:03:52Z

add(skills): evals suite for pr-description-skill (genesis-driven)

TL;DR

Adds a self-contained evals/ bundle and a stdlib-only runner under
.apm/skills/pr-description-skill/ so maintainers can answer two
questions deterministically and without LLM keys: (a) does the
SKILL.md description: reliably trigger on real intent and avoid
near-misses, and (b) does loading the SKILL.md body measurably
change the shape of an agent's PR description vs. not loading it.
All ship gates pass.

Problem (WHY)

The pr-description-skill shipped without an evals harness, so
every change to its description or body has been an unmeasured
edit. The MODULE ENTRYPOINT canonical spec calls for content evals
(with-skill vs without-skill) and ~20 trigger evals (60/40
train/val) gated on the validation split — none of that existed.
The genesis discipline says: "if with_skill and without_skill
produce indistinguishable outputs, the skill is not adding value."
We could not answer that question.

Approach (WHAT)

Walked genesis steps 1-6 to design the bundle, persisted the
handoff packet out-of-tree (session state), then drafted the
files in step 7b. The pattern is PIPELINE / single-loop
sequential (1 lens, 1 procedure) — no fan-out warranted.
Determinism is enforced everywhere: the trigger scorer is a
documented keyword/bigram matcher, content evals are pre-recorded
fixtures scored against regex rubrics, and exit codes are
machine-checkable.

Implementation (HOW)

Path	Purpose
`.apm/skills/pr-description-skill/scripts/run_evals.py`	Stdlib-only runner; `--filter`, `--split`, `--quiet`, `--no-write`, `--help`. JSON on stdout, diagnostics on stderr, exit codes 0/1/2.
`evals/evals.json`	Top-level manifest: gates, keyword lists, content scenario list.
`evals/triggers.json`	18 trigger items (9 should-fire + 9 should-not-fire), ~60/40 train/val. Should-fire queries authored verbatim against the skill's `description:` field; should-not-fire queries pick adjacent intents (commit msg, release notes, changelog, issue, review, design doc).
`evals/content/{auth-refactor,docs-only,dep-bump}.json`	Three PR-shape scenarios + regex rubrics. Shapes chosen for diversity: cross-cutting refactor, docs-only, mechanical dep bump.
`evals/fixtures/__with_skill.md` and `__without_skill.md`	Pre-recorded representative outputs for each scenario × condition. Deterministic, no API key needed. README documents the regeneration policy.
`evals/results/.gitkeep` + `evals/.gitignore`	Track the directory; ignore generated result JSON.
`evals/README.md`	How to run, scoring rules, gates, how to add evals, regeneration policy.
`.github/skills/pr-description-skill/...`	Mirror regenerated by `apm install --target copilot`.

Diagram

Single-thread pipeline. The runner reads inputs, scores, emits.

flowchart LR
  M["evals/evals.json"] --> R["run_evals.py"]
  T["evals/triggers.json"] --> R
  C["evals/content/*.json"] --> R
  F["evals/fixtures/*.md"] --> R
  S["SKILL.md (read-only)"] --> R
  R --> O["stdout JSON"]
  R --> E["stderr diagnostics"]
  R --> RES["evals/results/<UTC-ISO>.json"]

Trade-offs

Deterministic dispatcher approximation, not a real LLM:
the trigger scorer is a keyword/bigram matcher, not a live
dispatcher. It is fast, key-free, and CI-friendly; the README
is explicit about the limitation and the manifest schema
accommodates a future --llm flag.
Pre-recorded fixtures, not regenerated per run: trades
freshness for determinism. Maintainers regenerate them when
SKILL.md materially changes (documented in the README).
Eval bundle ships under the skill's source tree: a future
user-facing distribution-weight regression could push it out
to a maintainer-only tests/skill-evals/ directory; flagged
in the genesis handoff packet as LOW severity, not blocking
today.
Rejected: a separate eval-runner skill primitive. The
runner is a maintainer script, not a dispatched module — no
trigger description, no agentic reasoning, no need to make it
a sibling SKILL.

Benefits

The skill now ships with measurable ship gates (val-split
trigger gate + per-scenario content delta gate).
Future SKILL.md edits get a regression check for free.
Adding a new scenario or trigger query is one JSON edit.
Zero new dependencies (Python stdlib).

Validation

Ran the runner from the worktree root after every author step.
Final results from this commit:

$ python .apm/skills/pr-description-skill/scripts/run_evals.py

Family	Metric	Value	Gate	Met
triggers (val split)	should-fire correctness	1.00	>= 0.5	yes
triggers (val split)	should-not-fire correctness	1.00	>= 0.5	yes
content `auth-refactor`	delta_anchors	11	>= 1	yes
content `docs-only`	delta_anchors	6	>= 1	yes
content `dep-bump`	delta_anchors	9	>= 1	yes

Targeted unit tests:

$ uv run pytest tests/unit -x -k "skill or pr_description"
320 passed, 5254 deselected in 11.27s

Raw result JSON inline below.

Full evals/results JSON (val split, all scenarios)

{
  "schema_version": 1,
  "skill": "pr-description-skill",
  "timestamp_utc": "2026-04-27T09:02:47.191634+00:00",
  "filter": "all",
  "split": "val",
  "triggers": {
    "split": "val",
    "should_fire_correct_rate": 1.0,
    "should_not_fire_correct_rate": 1.0,
    "should_fire_gate": {
      "min": 0.5,
      "met": true
    },
    "should_not_fire_gate": {
      "max_miss_rate": 0.5,
      "met": true
    },
    "rows": [
      {
        "id": "fire-06",
        "split": "val",
        "query": "let's open the PR -- write the body",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "open the pr"
        }
      },
      {
        "id": "fire-07",
        "split": "val",
        "query": "draft a pull request description for this branch",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pull request description"
        }
      },
      {
        "id": "fire-08",
        "split": "val",
        "query": "pls write the pull request body",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pull request body"
        }
      },
      {
        "id": "fire-09",
        "split": "val",
        "query": "fill in the PR template before we ship",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pr template"
        }
      },
      {
        "id": "no-fire-06",
        "split": "val",
        "query": "summarize the diff for me",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "summarize the diff"
        }
      },
      {
        "id": "no-fire-07",
        "split": "val",
        "query": "draft a code review comment",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "code review comment"
        }
      },
      {
        "id": "no-fire-08",
        "split": "val",
        "query": "write the design doc for this branch",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "design doc"
        }
      },
      {
        "id": "no-fire-09",
        "split": "val",
        "query": "give me release notes for this milestone",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "release note"
        }
      }
    ],
    "passed": true
  },
  "content": {
    "delta_min_anchors": 1,
    "scenarios": [
      {
        "id": "auth-refactor",
        "summary": "Cross-cutting auth refactor: token resolution moves into a single AuthResolver; multiple call sites updated; one breaking change to a public CLI flag. PR is non-trivial and exercises every required section.",
        "with_skill": {
          "score": 13,
          "anchors_hit": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "anchored-quote",
            "approach-or-implementation",
            "mermaid-block",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": 0,
          "anchors_hit": [],
          "anchors_missed": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "anchored-quote",
            "approach-or-implementation",
            "mermaid-block",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 13,
        "delta_anchors": 11,
        "passed": true
      },
      {
        "id": "docs-only",
        "summary": "Markdown-only PR adding a quickstart guide. The skill must keep TL;DR, Problem, Validation, How-to-test sections even though the PR is 'trivial' (per SKILL.md gotcha).",
        "with_skill": {
          "score": 8,
          "anchors_hit": [
            "tldr-present",
            "problem-section",
            "anchored-quote",
            "validation-section",
            "validation-build",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": -1,
          "anchors_hit": [
            "validation-build",
            "no-trivial-skip"
          ],
          "anchors_missed": [
            "tldr-present",
            "problem-section",
            "anchored-quote",
            "validation-section",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 9,
        "delta_anchors": 6,
        "passed": true
      },
      {
        "id": "dep-bump",
        "summary": "Mechanical dependency upgrade: bump click from 8.1.7 to 8.2.1; lockfile-only changes plus a 1-line type annotation tweak. Tests Trade-offs may be 1-2 bullets and mermaid is optional.",
        "with_skill": {
          "score": 9,
          "anchors_hit": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "implementation-section",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": -2,
          "anchors_hit": [
            "no-marketing-tone"
          ],
          "anchors_missed": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "implementation-section",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 11,
        "delta_anchors": 9,
        "passed": true
      }
    ],
    "passed": true
  },
  "passed": true
}

Genesis findings on the skill itself

The genesis handoff packet recorded one LOW-severity finding:
the eval bundle adds maintainer-only weight inside the skill's
distribution surface (see Step 3.5 of the packet). Not blocking
today; revisit if user-visible install-size regresses.

No findings of higher severity. The content-eval delta is
positive on every scenario (smallest delta = 6 anchors), so
the skill IS adding measurable value over the no-skill baseline
on every shape we tested. Per the genesis discipline, this
clears the "redesign or delete" gate.

How to test

python .apm/skills/pr-description-skill/scripts/run_evals.py
— expect overall passed: true and exit code 0.
python .apm/skills/pr-description-skill/scripts/run_evals.py --filter triggers --split all
— expect 18/18 trigger evals correct.
python .apm/skills/pr-description-skill/scripts/run_evals.py --help
— expect a documented argparse help block.
uv run pytest tests/unit -x -k "skill or pr_description"
— expect 320 passing.
Inspect .apm/skills/pr-description-skill/evals/README.md
and confirm the scoring rules and regeneration policy
match expectations.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Adds a self-contained evals/ bundle and a non-interactive runner under .apm/skills/pr-description-skill/. Mirror regenerated via apm install --target copilot. - evals/evals.json: top-level manifest + ship gates - evals/triggers.json: 18 trigger items (9 fire / 9 no-fire), 60/40 train/val split - evals/content/*.json: 3 PR-shape scenarios (cross-cutting refactor, docs-only, mechanical dep bump) each with regex rubric and with_skill/without_skill fixtures - evals/fixtures/*.md: pre-recorded sample outputs for the two conditions; deterministic, no LLM API required at runtime - scripts/run_evals.py: stdlib-only runner; --filter, --split, --quiet, --no-write, --help; JSON on stdout, diagnostics on stderr, exit codes 0/1/2 - evals/results/.gitkeep + evals/.gitignore: track the dir, ignore generated result JSON - evals/README.md: how to run, scoring rules, gates, how to add evals, regeneration policy Genesis handoff packet persisted out-of-tree at the session state dir; design ends at step 6, this commit is step 7b output. Validation (val split): - triggers should-fire correctness = 1.0 (gate >= 0.5) - triggers should-not-fire correctness = 1.0 (gate >= 0.5) - content delta_anchors per scenario = 11, 6, 9 (gate >= 1) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danielmeppiel · 2026-04-27T09:04:03Z

Raw evals/results JSON (val split, all scenarios)

Captured from python .apm/skills/pr-description-skill/scripts/run_evals.py
on the head commit of this PR.

Click to expand full result JSON

{
  "schema_version": 1,
  "skill": "pr-description-skill",
  "timestamp_utc": "2026-04-27T09:02:47.191634+00:00",
  "filter": "all",
  "split": "val",
  "triggers": {
    "split": "val",
    "should_fire_correct_rate": 1.0,
    "should_not_fire_correct_rate": 1.0,
    "should_fire_gate": {
      "min": 0.5,
      "met": true
    },
    "should_not_fire_gate": {
      "max_miss_rate": 0.5,
      "met": true
    },
    "rows": [
      {
        "id": "fire-06",
        "split": "val",
        "query": "let's open the PR -- write the body",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "open the pr"
        }
      },
      {
        "id": "fire-07",
        "split": "val",
        "query": "draft a pull request description for this branch",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pull request description"
        }
      },
      {
        "id": "fire-08",
        "split": "val",
        "query": "pls write the pull request body",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pull request body"
        }
      },
      {
        "id": "fire-09",
        "split": "val",
        "query": "fill in the PR template before we ship",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pr template"
        }
      },
      {
        "id": "no-fire-06",
        "split": "val",
        "query": "summarize the diff for me",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "summarize the diff"
        }
      },
      {
        "id": "no-fire-07",
        "split": "val",
        "query": "draft a code review comment",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "code review comment"
        }
      },
      {
        "id": "no-fire-08",
        "split": "val",
        "query": "write the design doc for this branch",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "design doc"
        }
      },
      {
        "id": "no-fire-09",
        "split": "val",
        "query": "give me release notes for this milestone",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "release note"
        }
      }
    ],
    "passed": true
  },
  "content": {
    "delta_min_anchors": 1,
    "scenarios": [
      {
        "id": "auth-refactor",
        "summary": "Cross-cutting auth refactor: token resolution moves into a single AuthResolver; multiple call sites updated; one breaking change to a public CLI flag. PR is non-trivial and exercises every required section.",
        "with_skill": {
          "score": 13,
          "anchors_hit": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "anchored-quote",
            "approach-or-implementation",
            "mermaid-block",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": 0,
          "anchors_hit": [],
          "anchors_missed": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "anchored-quote",
            "approach-or-implementation",
            "mermaid-block",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 13,
        "delta_anchors": 11,
        "passed": true
      },
      {
        "id": "docs-only",
        "summary": "Markdown-only PR adding a quickstart guide. The skill must keep TL;DR, Problem, Validation, How-to-test sections even though the PR is 'trivial' (per SKILL.md gotcha).",
        "with_skill": {
          "score": 8,
          "anchors_hit": [
            "tldr-present",
            "problem-section",
            "anchored-quote",
            "validation-section",
            "validation-build",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": -1,
          "anchors_hit": [
            "validation-build",
            "no-trivial-skip"
          ],
          "anchors_missed": [
            "tldr-present",
            "problem-section",
            "anchored-quote",
            "validation-section",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 9,
        "delta_anchors": 6,
        "passed": true
      },
      {
        "id": "dep-bump",
        "summary": "Mechanical dependency upgrade: bump click from 8.1.7 to 8.2.1; lockfile-only changes plus a 1-line type annotation tweak. Tests Trade-offs may be 1-2 bullets and mermaid is optional.",
        "with_skill": {
          "score": 9,
          "anchors_hit": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "implementation-section",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": -2,
          "anchors_hit": [
            "no-marketing-tone"
          ],
          "anchors_missed": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "implementation-section",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 11,
        "delta_anchors": 9,
        "passed": true
      }
    ],
    "passed": true
  },
  "passed": true
}

Validation summary (val split is the ship gate):

triggers should-fire correctness = 1.00 (gate >= 0.5)
triggers should-not-fire correctness = 1.00 (gate >= 0.5)
content auth-refactor delta_anchors = 11 (gate >= 1)
content docs-only delta_anchors = 6 (gate >= 1)
content dep-bump delta_anchors = 9 (gate >= 1)

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

danielmeppiel · 2026-04-27T09:08:16Z

Post-merge evals run on `main`

Ran the evals harness on main after PR #985 merged. Confirms the suite works on the as-merged tree.

$ uv run python .apm/skills/pr-description-skill/scripts/run_evals.py

Result: PASSED (file: 2026-04-27T09-07-33.079410Z.json)

Trigger evals (val split = ship gate)

Metric	Value	Gate	Met?
should_fire_correct_rate	1.00	>= 0.50	yes
should_not_fire_correct_rate	1.00	<= 0.50 miss	yes

Content evals (with_skill vs without_skill)

Scenario	with	without	delta_anchors	passed
auth-refactor	13	0	+11	yes
docs-only	8	-1	+6	yes
dep-bump	9	-2	+9	yes

All scenarios clear the genesis "redesign or delete" gate (delta >= 1).

Raw result JSON

{
  "schema_version": 1,
  "skill": "pr-description-skill",
  "timestamp_utc": "2026-04-27T09:07:33.079410+00:00",
  "filter": "all",
  "split": "val",
  "triggers": {
    "split": "val",
    "should_fire_correct_rate": 1.0,
    "should_not_fire_correct_rate": 1.0,
    "should_fire_gate": {
      "min": 0.5,
      "met": true
    },
    "should_not_fire_gate": {
      "max_miss_rate": 0.5,
      "met": true
    },
    "rows": [
      {
        "id": "fire-06",
        "split": "val",
        "query": "let's open the PR -- write the body",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "open the pr"
        }
      },
      {
        "id": "fire-07",
        "split": "val",
        "query": "draft a pull request description for this branch",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pull request description"
        }
      },
      {
        "id": "fire-08",
        "split": "val",
        "query": "pls write the pull request body",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pull request body"
        }
      },
      {
        "id": "fire-09",
        "split": "val",
        "query": "fill in the PR template before we ship",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pr template"
        }
      },
      {
        "id": "no-fire-06",
        "split": "val",
        "query": "summarize the diff for me",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "summarize the diff"
        }
      },
      {
        "id": "no-fire-07",
        "split": "val",
        "query": "draft a code review comment",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "code review comment"
        }
      },
      {
        "id": "no-fire-08",
        "split": "val",
        "query": "write the design doc for this branch",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "design doc"
        }
      },
      {
        "id": "no-fire-09",
        "split": "val",
        "query": "give me release notes for this milestone",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "release note"
        }
      }
    ],
    "passed": true
  },
  "content": {
    "delta_min_anchors": 1,
    "scenarios": [
      {
        "id": "auth-refactor",
        "summary": "Cross-cutting auth refactor: token resolution moves into a single AuthResolver; multiple call sites updated; one breaking change to a public CLI flag. PR is non-trivial and exercises every required section.",
        "with_skill": {
          "score": 13,
          "anchors_hit": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "anchored-quote",
            "approach-or-implementation",
            "mermaid-block",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": 0,
          "anchors_hit": [],
          "anchors_missed": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "anchored-quote",
            "approach-or-implementation",
            "mermaid-block",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 13,
        "delta_anchors": 11,
        "passed": true
      },
      {
        "id": "docs-only",
        "summary": "Markdown-only PR adding a quickstart guide. The skill must keep TL;DR, Problem, Validation, How-to-test sections even though the PR is 'trivial' (per SKILL.md gotcha).",
        "with_skill": {
          "score": 8,
          "anchors_hit": [
            "tldr-present",
            "problem-section",
            "anchored-quote",
            "validation-section",
            "validation-build",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": -1,
          "anchors_hit": [
            "validation-build",
            "no-trivial-skip"
          ],
          "anchors_missed": [
            "tldr-present",
            "problem-section",
            "anchored-quote",
            "validation-section",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 9,
        "delta_anchors": 6,
        "passed": true
      },
      {
        "id": "dep-bump",
        "summary": "Mechanical dependency upgrade: bump click from 8.1.7 to 8.2.1; lockfile-only changes plus a 1-line type annotation tweak. Tests Trade-offs may be 1-2 bullets and mermaid is optional.",
        "with_skill": {
          "score": 9,
          "anchors_hit": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "implementation-section",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": -2,
          "anchors_hit": [
            "no-marketing-tone"
          ],
          "anchors_missed": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "implementation-section",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 11,
        "delta_anchors": 9,
        "passed": true
      }
    ],
    "passed": true
  },
  "passed": true
}

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Copilot

Pull request overview

Adds a deterministic, stdlib-only eval harness for pr-description-skill, intended to let maintainers validate trigger matching (should-fire/should-not-fire) and measure content “shape” deltas (with-skill vs without-skill) without requiring any LLM/API keys. The changes are mirrored into both .apm/ (authoritative sources) and .github/ (Copilot target output).

Changes:

Introduces scripts/run_evals.py runner (JSON summary on stdout, diagnostics on stderr, exit codes 0/1/2).
Adds trigger eval dataset (triggers.json) plus content scenarios, fixtures, and rubrics under evals/.
Adds eval bundle docs and ignores generated results while keeping a tracked results/ directory.

Show a summary per file

File	Description
.github/skills/pr-description-skill/scripts/run_evals.py	Mirrored eval runner script for Copilot target
.github/skills/pr-description-skill/evals/evals.json	Mirrored top-level eval manifest (gates + keyword lists)
.github/skills/pr-description-skill/evals/triggers.json	Mirrored trigger eval items (train/val split)
.github/skills/pr-description-skill/evals/content/auth-refactor.json	Mirrored content scenario + rubric
.github/skills/pr-description-skill/evals/content/docs-only.json	Mirrored content scenario + rubric
.github/skills/pr-description-skill/evals/content/dep-bump.json	Mirrored content scenario + rubric
.github/skills/pr-description-skill/evals/fixtures/auth-refactor__with_skill.md	Mirrored with-skill fixture output
.github/skills/pr-description-skill/evals/fixtures/auth-refactor__without_skill.md	Mirrored without-skill fixture output
.github/skills/pr-description-skill/evals/fixtures/docs-only__with_skill.md	Mirrored with-skill fixture output
.github/skills/pr-description-skill/evals/fixtures/docs-only__without_skill.md	Mirrored without-skill fixture output
.github/skills/pr-description-skill/evals/fixtures/dep-bump__with_skill.md	Mirrored with-skill fixture output
.github/skills/pr-description-skill/evals/fixtures/dep-bump__without_skill.md	Mirrored without-skill fixture output
.github/skills/pr-description-skill/evals/results/.gitkeep	Mirrored tracked sentinel for results directory
.github/skills/pr-description-skill/evals/.gitignore	Mirrored ignore for generated result JSON files
.github/skills/pr-description-skill/evals/README.md	Mirrored documentation for running/scoring evals
.apm/skills/pr-description-skill/scripts/run_evals.py	Source eval runner script for skill
.apm/skills/pr-description-skill/evals/evals.json	Source top-level eval manifest (gates + keyword lists)
.apm/skills/pr-description-skill/evals/triggers.json	Source trigger eval items (train/val split)
.apm/skills/pr-description-skill/evals/content/auth-refactor.json	Source content scenario + rubric
.apm/skills/pr-description-skill/evals/content/docs-only.json	Source content scenario + rubric
.apm/skills/pr-description-skill/evals/content/dep-bump.json	Source content scenario + rubric
.apm/skills/pr-description-skill/evals/fixtures/auth-refactor__with_skill.md	Source with-skill fixture output
.apm/skills/pr-description-skill/evals/fixtures/auth-refactor__without_skill.md	Source without-skill fixture output
.apm/skills/pr-description-skill/evals/fixtures/docs-only__with_skill.md	Source with-skill fixture output
.apm/skills/pr-description-skill/evals/fixtures/docs-only__without_skill.md	Source without-skill fixture output
.apm/skills/pr-description-skill/evals/fixtures/dep-bump__with_skill.md	Source with-skill fixture output
.apm/skills/pr-description-skill/evals/fixtures/dep-bump__without_skill.md	Source without-skill fixture output
.apm/skills/pr-description-skill/evals/results/.gitkeep	Source tracked sentinel for results directory
.apm/skills/pr-description-skill/evals/.gitignore	Source ignore for generated result JSON files
.apm/skills/pr-description-skill/evals/README.md	Source documentation for running/scoring evals

Copilot's findings

Files reviewed: 28/30 changed files
Comments generated: 16

Copilot · 2026-04-27T09:11:32Z

+    gates = manifest["gates"]["triggers"]
+    fire_gate_met = fire_rate >= gates["should_fire_rate_min"]
+    no_fire_gate_met = (
+        (no_fire_rate) >= (1.0 - gates["should_not_fire_rate_max"])
+    )


The gate field name should_not_fire_rate_max is treated as a max miss/leak rate (see 1.0 - ...) and is reported as max_miss_rate in output. This is easy to misconfigure because the name reads like a maximum correctness rate. Rename the manifest key to something like should_not_fire_miss_rate_max (and update the output/README accordingly), or adjust the logic to match the current name.

Copilot · 2026-04-27T09:11:32Z

+def run_trigger_evals(
+    manifest: dict[str, Any], split: str
+) -> dict[str, Any]:
+    triggers_path = EVALS_DIR / manifest["triggers_manifest"]
+    triggers = _load_json(triggers_path)["items"]
+
+    if split != "all":
+        triggers = [t for t in triggers if t["split"] == split]
+


--split all is documented as "score both splits", but run_trigger_evals() currently aggregates train+val together into a single rate and applies the gate to that aggregate. This makes the ship gate ambiguous (it should typically be evaluated on the configured gate split, e.g. val). Consider returning per-split summaries when split=="all" and applying the gate only to manifest["gates"]["triggers"]["split"] (or disallow all for gating).

Copilot · 2026-04-27T09:11:33Z

+def score_fixture(text: str, rubric: list[dict[str, Any]]) -> dict[str, Any]:
+    score = 0
+    anchors_hit: list[str] = []
+    anchors_missed: list[str] = []
+    for check in rubric:
+        pattern = re.compile(check["pattern"])
+        match = bool(pattern.search(text))
+        weight = int(check["weight"])


score_fixture() compiles regex patterns from the scenario JSON without handling invalid patterns. A malformed pattern will raise re.error and crash with exit code 1 (stack trace) instead of emitting a structured runner error with exit code 2 like other manifest/fixture failures. Catch re.error (and possibly KeyError/TypeError for malformed rubric entries) and convert it into a logged [x] error + sys.exit(2).

Copilot · 2026-04-27T09:11:33Z

+    "triggers": {
+      "split": "val",
+      "should_fire_rate_min": 0.5,
+      "should_not_fire_rate_max": 0.5


The triggers gate key should_not_fire_rate_max reads like a maximum correctness rate, but the runner treats it as a maximum miss/leak rate (see output field max_miss_rate). Rename this key to reflect its semantics (e.g. should_not_fire_miss_rate_max) or adjust the runner logic so the name and behavior match.

Suggested change

"should_not_fire_rate_max": 0.5

"should_not_fire_miss_rate_max": 0.5

Copilot · 2026-04-27T09:11:33Z

+    with_score = score_fixture(with_text, scenario["rubric"])
+    without_score = score_fixture(without_text, scenario["rubric"])
+
+    delta_score = with_score["score"] - without_score["score"]
+    delta_anchors = len(
+        set(with_score["anchors_hit"]) - set(without_score["anchors_hit"])
+    )


delta_anchors is computed from anchors_hit, but score_fixture() currently appends penalty checks (negative weights) into anchors_hit when they match. This means a worse with_skill output that triggers a penalty (e.g., marketing tone) could increase delta_anchors and still pass the content gate. Consider tracking penalty hits separately (or excluding checks with weight < 0 from anchors_hit / delta_anchors calculations).

Copilot · 2026-04-27T09:11:36Z

+    with_score = score_fixture(with_text, scenario["rubric"])
+    without_score = score_fixture(without_text, scenario["rubric"])
+
+    delta_score = with_score["score"] - without_score["score"]
+    delta_anchors = len(
+        set(with_score["anchors_hit"]) - set(without_score["anchors_hit"])
+    )


delta_anchors is computed from anchors_hit, but score_fixture() currently appends penalty checks (negative weights) into anchors_hit when they match. This means a worse with_skill output that triggers a penalty (e.g., marketing tone) could increase delta_anchors and still pass the content gate. Consider tracking penalty hits separately (or excluding checks with weight < 0 from anchors_hit / delta_anchors calculations).

Copilot · 2026-04-27T09:11:36Z

+def run_trigger_evals(
+    manifest: dict[str, Any], split: str
+) -> dict[str, Any]:
+    triggers_path = EVALS_DIR / manifest["triggers_manifest"]
+    triggers = _load_json(triggers_path)["items"]
+
+    if split != "all":
+        triggers = [t for t in triggers if t["split"] == split]
+


--split all is documented as "score both splits", but run_trigger_evals() currently aggregates train+val together into a single rate and applies the gate to that aggregate. This makes the ship gate ambiguous (it should typically be evaluated on the configured gate split, e.g. val). Consider returning per-split summaries when split=="all" and applying the gate only to manifest["gates"]["triggers"]["split"] (or disallow all for gating).

Copilot · 2026-04-27T09:11:36Z

+        ts = summary["timestamp_utc"].replace(":", "-").replace("+00-00", "Z")
+        out_path = results_dir / f"{ts}.json"
+        out_path.write_text(json.dumps(summary, indent=2) + "\n", encoding="utf-8")
+        summary["result_file"] = str(out_path.relative_to(SKILL_DIR.parent.parent.parent))


summary["result_file"] = out_path.relative_to(...) will raise ValueError if the user passes --results-dir outside the repo/skill tree, causing the runner to crash after successfully writing results. Consider guarding this with try/except ValueError and falling back to an absolute path (or omit result_file) so --results-dir remains usable.

Suggested change

summary["result_file"] = str(out_path.relative_to(SKILL_DIR.parent.parent.parent))

try:

summary["result_file"] = str(

out_path.relative_to(SKILL_DIR.parent.parent.parent)

)

except ValueError:

summary["result_file"] = str(out_path.resolve())

Copilot · 2026-04-27T09:11:36Z

+        "--manifest",
+        default=str(EVALS_DIR / "evals.json"),
+        help="Path to evals manifest (default: <skill>/evals/evals.json).",
+    )


The --manifest flag suggests you can point the runner at an alternate manifest, but path resolution for triggers_manifest / content_manifests is hard-coded to EVALS_DIR (derived from the script location), not relative to the provided manifest file. If a user passes a manifest path outside the default directory, the runner will still try to load triggers/content from the built-in evals/ dir. Consider resolving relative paths against Path(args.manifest).parent (and passing that base into run_trigger_evals / run_content_evals).

Copilot · 2026-04-27T09:11:37Z

+def score_fixture(text: str, rubric: list[dict[str, Any]]) -> dict[str, Any]:
+    score = 0
+    anchors_hit: list[str] = []
+    anchors_missed: list[str] = []
+    for check in rubric:
+        pattern = re.compile(check["pattern"])
+        match = bool(pattern.search(text))
+        weight = int(check["weight"])


score_fixture() compiles regex patterns from the scenario JSON without handling invalid patterns. A malformed pattern will raise re.error and crash with exit code 1 (stack trace) instead of emitting a structured runner error with exit code 2 like other manifest/fixture failures. Catch re.error (and possibly KeyError/TypeError for malformed rubric entries) and convert it into a logged [x] error + sys.exit(2).

Promotes [Unreleased] to [0.10.0] - 2026-04-27. Each PR since v0.9.4 gets one 'so what' line: - #926 Microsoft 365 Cowork target ships impl - #790 marketplace authoring CLI (init, package add/set, build, check, outdated, doctor, publish) -- collapsed from 20+ bullets to one - #722 marketplace plugin -> package rename + --help sectioning -- collapsed - #980 README 'Coming from npx skills add' conversion block - #981 docs auto-deploy on tag push (real fix for the #953 attempt) - #985 pr-description-skill evals suite - #984 pr-description-skill mermaid hardening - #989 cowork sys.platform mock for Windows CI Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 27, 2026 09:03

danielmeppiel requested a review from sergio-sisternes-epam as a code owner April 27, 2026 09:03

Copilot started reviewing on behalf of danielmeppiel April 27, 2026 09:04 View session

danielmeppiel merged commit ee29b0b into main Apr 27, 2026
17 checks passed

danielmeppiel deleted the feat/pr-description-skill-evals branch April 27, 2026 09:05

Copilot AI reviewed Apr 27, 2026

View reviewed changes

danielmeppiel mentioned this pull request Apr 27, 2026

chore(release): cut 0.10.0 #986

Merged

	"should_not_fire_rate_max": 0.5
	"should_not_fire_miss_rate_max": 0.5

Conversation

danielmeppiel commented Apr 27, 2026

TL;DR

Problem (WHY)

Approach (WHAT)

Implementation (HOW)

Diagram

Trade-offs

Benefits

Validation

Genesis findings on the skill itself

How to test

Uh oh!

danielmeppiel commented Apr 27, 2026

Raw evals/results JSON (val split, all scenarios)

Uh oh!

Uh oh!

danielmeppiel commented Apr 27, 2026

Post-merge evals run on main

Trigger evals (val split = ship gate)

Content evals (with_skill vs without_skill)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Post-merge evals run on `main`