Add evals suite to pr-description-skill (genesis-driven)#985
Add evals suite to pr-description-skill (genesis-driven)#985danielmeppiel merged 1 commit intomainfrom
Conversation
Adds a self-contained evals/ bundle and a non-interactive runner under .apm/skills/pr-description-skill/. Mirror regenerated via apm install --target copilot. - evals/evals.json: top-level manifest + ship gates - evals/triggers.json: 18 trigger items (9 fire / 9 no-fire), 60/40 train/val split - evals/content/*.json: 3 PR-shape scenarios (cross-cutting refactor, docs-only, mechanical dep bump) each with regex rubric and with_skill/without_skill fixtures - evals/fixtures/*.md: pre-recorded sample outputs for the two conditions; deterministic, no LLM API required at runtime - scripts/run_evals.py: stdlib-only runner; --filter, --split, --quiet, --no-write, --help; JSON on stdout, diagnostics on stderr, exit codes 0/1/2 - evals/results/.gitkeep + evals/.gitignore: track the dir, ignore generated result JSON - evals/README.md: how to run, scoring rules, gates, how to add evals, regeneration policy Genesis handoff packet persisted out-of-tree at the session state dir; design ends at step 6, this commit is step 7b output. Validation (val split): - triggers should-fire correctness = 1.0 (gate >= 0.5) - triggers should-not-fire correctness = 1.0 (gate >= 0.5) - content delta_anchors per scenario = 11, 6, 9 (gate >= 1) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Raw evals/results JSON (val split, all scenarios)Captured from Click to expand full result JSON{
"schema_version": 1,
"skill": "pr-description-skill",
"timestamp_utc": "2026-04-27T09:02:47.191634+00:00",
"filter": "all",
"split": "val",
"triggers": {
"split": "val",
"should_fire_correct_rate": 1.0,
"should_not_fire_correct_rate": 1.0,
"should_fire_gate": {
"min": 0.5,
"met": true
},
"should_not_fire_gate": {
"max_miss_rate": 0.5,
"met": true
},
"rows": [
{
"id": "fire-06",
"split": "val",
"query": "let's open the PR -- write the body",
"expected": "fire",
"predicted": "fire",
"passed": true,
"diagnostic": {
"reason": "primary_match",
"match": "open the pr"
}
},
{
"id": "fire-07",
"split": "val",
"query": "draft a pull request description for this branch",
"expected": "fire",
"predicted": "fire",
"passed": true,
"diagnostic": {
"reason": "primary_match",
"match": "pull request description"
}
},
{
"id": "fire-08",
"split": "val",
"query": "pls write the pull request body",
"expected": "fire",
"predicted": "fire",
"passed": true,
"diagnostic": {
"reason": "primary_match",
"match": "pull request body"
}
},
{
"id": "fire-09",
"split": "val",
"query": "fill in the PR template before we ship",
"expected": "fire",
"predicted": "fire",
"passed": true,
"diagnostic": {
"reason": "primary_match",
"match": "pr template"
}
},
{
"id": "no-fire-06",
"split": "val",
"query": "summarize the diff for me",
"expected": "no_fire",
"predicted": "no_fire",
"passed": true,
"diagnostic": {
"reason": "stop_list_hit",
"match": "summarize the diff"
}
},
{
"id": "no-fire-07",
"split": "val",
"query": "draft a code review comment",
"expected": "no_fire",
"predicted": "no_fire",
"passed": true,
"diagnostic": {
"reason": "stop_list_hit",
"match": "code review comment"
}
},
{
"id": "no-fire-08",
"split": "val",
"query": "write the design doc for this branch",
"expected": "no_fire",
"predicted": "no_fire",
"passed": true,
"diagnostic": {
"reason": "stop_list_hit",
"match": "design doc"
}
},
{
"id": "no-fire-09",
"split": "val",
"query": "give me release notes for this milestone",
"expected": "no_fire",
"predicted": "no_fire",
"passed": true,
"diagnostic": {
"reason": "stop_list_hit",
"match": "release note"
}
}
],
"passed": true
},
"content": {
"delta_min_anchors": 1,
"scenarios": [
{
"id": "auth-refactor",
"summary": "Cross-cutting auth refactor: token resolution moves into a single AuthResolver; multiple call sites updated; one breaking change to a public CLI flag. PR is non-trivial and exercises every required section.",
"with_skill": {
"score": 13,
"anchors_hit": [
"tldr-present",
"tldr-short",
"problem-section",
"anchored-quote",
"approach-or-implementation",
"mermaid-block",
"tradeoffs-section",
"validation-section",
"validation-real-output",
"how-to-test",
"trailer"
],
"anchors_missed": []
},
"without_skill": {
"score": 0,
"anchors_hit": [],
"anchors_missed": [
"tldr-present",
"tldr-short",
"problem-section",
"anchored-quote",
"approach-or-implementation",
"mermaid-block",
"tradeoffs-section",
"validation-section",
"validation-real-output",
"how-to-test",
"trailer"
]
},
"delta_score": 13,
"delta_anchors": 11,
"passed": true
},
{
"id": "docs-only",
"summary": "Markdown-only PR adding a quickstart guide. The skill must keep TL;DR, Problem, Validation, How-to-test sections even though the PR is 'trivial' (per SKILL.md gotcha).",
"with_skill": {
"score": 8,
"anchors_hit": [
"tldr-present",
"problem-section",
"anchored-quote",
"validation-section",
"validation-build",
"how-to-test",
"trailer"
],
"anchors_missed": []
},
"without_skill": {
"score": -1,
"anchors_hit": [
"validation-build",
"no-trivial-skip"
],
"anchors_missed": [
"tldr-present",
"problem-section",
"anchored-quote",
"validation-section",
"how-to-test",
"trailer"
]
},
"delta_score": 9,
"delta_anchors": 6,
"passed": true
},
{
"id": "dep-bump",
"summary": "Mechanical dependency upgrade: bump click from 8.1.7 to 8.2.1; lockfile-only changes plus a 1-line type annotation tweak. Tests Trade-offs may be 1-2 bullets and mermaid is optional.",
"with_skill": {
"score": 9,
"anchors_hit": [
"tldr-present",
"tldr-short",
"problem-section",
"implementation-section",
"tradeoffs-section",
"validation-section",
"validation-real-output",
"how-to-test",
"trailer"
],
"anchors_missed": []
},
"without_skill": {
"score": -2,
"anchors_hit": [
"no-marketing-tone"
],
"anchors_missed": [
"tldr-present",
"tldr-short",
"problem-section",
"implementation-section",
"tradeoffs-section",
"validation-section",
"validation-real-output",
"how-to-test",
"trailer"
]
},
"delta_score": 11,
"delta_anchors": 9,
"passed": true
}
],
"passed": true
},
"passed": true
}
Validation summary (val split is the ship gate):
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com |
Post-merge evals run on
|
| Metric | Value | Gate | Met? |
|---|---|---|---|
| should_fire_correct_rate | 1.00 | >= 0.50 | yes |
| should_not_fire_correct_rate | 1.00 | <= 0.50 miss | yes |
Content evals (with_skill vs without_skill)
| Scenario | with | without | delta_anchors | passed |
|---|---|---|---|---|
| auth-refactor | 13 | 0 | +11 | yes |
| docs-only | 8 | -1 | +6 | yes |
| dep-bump | 9 | -2 | +9 | yes |
All scenarios clear the genesis "redesign or delete" gate (delta >= 1).
Raw result JSON
{
"schema_version": 1,
"skill": "pr-description-skill",
"timestamp_utc": "2026-04-27T09:07:33.079410+00:00",
"filter": "all",
"split": "val",
"triggers": {
"split": "val",
"should_fire_correct_rate": 1.0,
"should_not_fire_correct_rate": 1.0,
"should_fire_gate": {
"min": 0.5,
"met": true
},
"should_not_fire_gate": {
"max_miss_rate": 0.5,
"met": true
},
"rows": [
{
"id": "fire-06",
"split": "val",
"query": "let's open the PR -- write the body",
"expected": "fire",
"predicted": "fire",
"passed": true,
"diagnostic": {
"reason": "primary_match",
"match": "open the pr"
}
},
{
"id": "fire-07",
"split": "val",
"query": "draft a pull request description for this branch",
"expected": "fire",
"predicted": "fire",
"passed": true,
"diagnostic": {
"reason": "primary_match",
"match": "pull request description"
}
},
{
"id": "fire-08",
"split": "val",
"query": "pls write the pull request body",
"expected": "fire",
"predicted": "fire",
"passed": true,
"diagnostic": {
"reason": "primary_match",
"match": "pull request body"
}
},
{
"id": "fire-09",
"split": "val",
"query": "fill in the PR template before we ship",
"expected": "fire",
"predicted": "fire",
"passed": true,
"diagnostic": {
"reason": "primary_match",
"match": "pr template"
}
},
{
"id": "no-fire-06",
"split": "val",
"query": "summarize the diff for me",
"expected": "no_fire",
"predicted": "no_fire",
"passed": true,
"diagnostic": {
"reason": "stop_list_hit",
"match": "summarize the diff"
}
},
{
"id": "no-fire-07",
"split": "val",
"query": "draft a code review comment",
"expected": "no_fire",
"predicted": "no_fire",
"passed": true,
"diagnostic": {
"reason": "stop_list_hit",
"match": "code review comment"
}
},
{
"id": "no-fire-08",
"split": "val",
"query": "write the design doc for this branch",
"expected": "no_fire",
"predicted": "no_fire",
"passed": true,
"diagnostic": {
"reason": "stop_list_hit",
"match": "design doc"
}
},
{
"id": "no-fire-09",
"split": "val",
"query": "give me release notes for this milestone",
"expected": "no_fire",
"predicted": "no_fire",
"passed": true,
"diagnostic": {
"reason": "stop_list_hit",
"match": "release note"
}
}
],
"passed": true
},
"content": {
"delta_min_anchors": 1,
"scenarios": [
{
"id": "auth-refactor",
"summary": "Cross-cutting auth refactor: token resolution moves into a single AuthResolver; multiple call sites updated; one breaking change to a public CLI flag. PR is non-trivial and exercises every required section.",
"with_skill": {
"score": 13,
"anchors_hit": [
"tldr-present",
"tldr-short",
"problem-section",
"anchored-quote",
"approach-or-implementation",
"mermaid-block",
"tradeoffs-section",
"validation-section",
"validation-real-output",
"how-to-test",
"trailer"
],
"anchors_missed": []
},
"without_skill": {
"score": 0,
"anchors_hit": [],
"anchors_missed": [
"tldr-present",
"tldr-short",
"problem-section",
"anchored-quote",
"approach-or-implementation",
"mermaid-block",
"tradeoffs-section",
"validation-section",
"validation-real-output",
"how-to-test",
"trailer"
]
},
"delta_score": 13,
"delta_anchors": 11,
"passed": true
},
{
"id": "docs-only",
"summary": "Markdown-only PR adding a quickstart guide. The skill must keep TL;DR, Problem, Validation, How-to-test sections even though the PR is 'trivial' (per SKILL.md gotcha).",
"with_skill": {
"score": 8,
"anchors_hit": [
"tldr-present",
"problem-section",
"anchored-quote",
"validation-section",
"validation-build",
"how-to-test",
"trailer"
],
"anchors_missed": []
},
"without_skill": {
"score": -1,
"anchors_hit": [
"validation-build",
"no-trivial-skip"
],
"anchors_missed": [
"tldr-present",
"problem-section",
"anchored-quote",
"validation-section",
"how-to-test",
"trailer"
]
},
"delta_score": 9,
"delta_anchors": 6,
"passed": true
},
{
"id": "dep-bump",
"summary": "Mechanical dependency upgrade: bump click from 8.1.7 to 8.2.1; lockfile-only changes plus a 1-line type annotation tweak. Tests Trade-offs may be 1-2 bullets and mermaid is optional.",
"with_skill": {
"score": 9,
"anchors_hit": [
"tldr-present",
"tldr-short",
"problem-section",
"implementation-section",
"tradeoffs-section",
"validation-section",
"validation-real-output",
"how-to-test",
"trailer"
],
"anchors_missed": []
},
"without_skill": {
"score": -2,
"anchors_hit": [
"no-marketing-tone"
],
"anchors_missed": [
"tldr-present",
"tldr-short",
"problem-section",
"implementation-section",
"tradeoffs-section",
"validation-section",
"validation-real-output",
"how-to-test",
"trailer"
]
},
"delta_score": 11,
"delta_anchors": 9,
"passed": true
}
],
"passed": true
},
"passed": true
}
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com
There was a problem hiding this comment.
Pull request overview
Adds a deterministic, stdlib-only eval harness for pr-description-skill, intended to let maintainers validate trigger matching (should-fire/should-not-fire) and measure content “shape” deltas (with-skill vs without-skill) without requiring any LLM/API keys. The changes are mirrored into both .apm/ (authoritative sources) and .github/ (Copilot target output).
Changes:
- Introduces
scripts/run_evals.pyrunner (JSON summary on stdout, diagnostics on stderr, exit codes 0/1/2). - Adds trigger eval dataset (
triggers.json) plus content scenarios, fixtures, and rubrics underevals/. - Adds eval bundle docs and ignores generated results while keeping a tracked
results/directory.
Show a summary per file
| File | Description |
|---|---|
| .github/skills/pr-description-skill/scripts/run_evals.py | Mirrored eval runner script for Copilot target |
| .github/skills/pr-description-skill/evals/evals.json | Mirrored top-level eval manifest (gates + keyword lists) |
| .github/skills/pr-description-skill/evals/triggers.json | Mirrored trigger eval items (train/val split) |
| .github/skills/pr-description-skill/evals/content/auth-refactor.json | Mirrored content scenario + rubric |
| .github/skills/pr-description-skill/evals/content/docs-only.json | Mirrored content scenario + rubric |
| .github/skills/pr-description-skill/evals/content/dep-bump.json | Mirrored content scenario + rubric |
| .github/skills/pr-description-skill/evals/fixtures/auth-refactor__with_skill.md | Mirrored with-skill fixture output |
| .github/skills/pr-description-skill/evals/fixtures/auth-refactor__without_skill.md | Mirrored without-skill fixture output |
| .github/skills/pr-description-skill/evals/fixtures/docs-only__with_skill.md | Mirrored with-skill fixture output |
| .github/skills/pr-description-skill/evals/fixtures/docs-only__without_skill.md | Mirrored without-skill fixture output |
| .github/skills/pr-description-skill/evals/fixtures/dep-bump__with_skill.md | Mirrored with-skill fixture output |
| .github/skills/pr-description-skill/evals/fixtures/dep-bump__without_skill.md | Mirrored without-skill fixture output |
| .github/skills/pr-description-skill/evals/results/.gitkeep | Mirrored tracked sentinel for results directory |
| .github/skills/pr-description-skill/evals/.gitignore | Mirrored ignore for generated result JSON files |
| .github/skills/pr-description-skill/evals/README.md | Mirrored documentation for running/scoring evals |
| .apm/skills/pr-description-skill/scripts/run_evals.py | Source eval runner script for skill |
| .apm/skills/pr-description-skill/evals/evals.json | Source top-level eval manifest (gates + keyword lists) |
| .apm/skills/pr-description-skill/evals/triggers.json | Source trigger eval items (train/val split) |
| .apm/skills/pr-description-skill/evals/content/auth-refactor.json | Source content scenario + rubric |
| .apm/skills/pr-description-skill/evals/content/docs-only.json | Source content scenario + rubric |
| .apm/skills/pr-description-skill/evals/content/dep-bump.json | Source content scenario + rubric |
| .apm/skills/pr-description-skill/evals/fixtures/auth-refactor__with_skill.md | Source with-skill fixture output |
| .apm/skills/pr-description-skill/evals/fixtures/auth-refactor__without_skill.md | Source without-skill fixture output |
| .apm/skills/pr-description-skill/evals/fixtures/docs-only__with_skill.md | Source with-skill fixture output |
| .apm/skills/pr-description-skill/evals/fixtures/docs-only__without_skill.md | Source without-skill fixture output |
| .apm/skills/pr-description-skill/evals/fixtures/dep-bump__with_skill.md | Source with-skill fixture output |
| .apm/skills/pr-description-skill/evals/fixtures/dep-bump__without_skill.md | Source without-skill fixture output |
| .apm/skills/pr-description-skill/evals/results/.gitkeep | Source tracked sentinel for results directory |
| .apm/skills/pr-description-skill/evals/.gitignore | Source ignore for generated result JSON files |
| .apm/skills/pr-description-skill/evals/README.md | Source documentation for running/scoring evals |
Copilot's findings
- Files reviewed: 28/30 changed files
- Comments generated: 16
| gates = manifest["gates"]["triggers"] | ||
| fire_gate_met = fire_rate >= gates["should_fire_rate_min"] | ||
| no_fire_gate_met = ( | ||
| (no_fire_rate) >= (1.0 - gates["should_not_fire_rate_max"]) | ||
| ) |
There was a problem hiding this comment.
The gate field name should_not_fire_rate_max is treated as a max miss/leak rate (see 1.0 - ...) and is reported as max_miss_rate in output. This is easy to misconfigure because the name reads like a maximum correctness rate. Rename the manifest key to something like should_not_fire_miss_rate_max (and update the output/README accordingly), or adjust the logic to match the current name.
| def run_trigger_evals( | ||
| manifest: dict[str, Any], split: str | ||
| ) -> dict[str, Any]: | ||
| triggers_path = EVALS_DIR / manifest["triggers_manifest"] | ||
| triggers = _load_json(triggers_path)["items"] | ||
|
|
||
| if split != "all": | ||
| triggers = [t for t in triggers if t["split"] == split] | ||
|
|
There was a problem hiding this comment.
--split all is documented as "score both splits", but run_trigger_evals() currently aggregates train+val together into a single rate and applies the gate to that aggregate. This makes the ship gate ambiguous (it should typically be evaluated on the configured gate split, e.g. val). Consider returning per-split summaries when split=="all" and applying the gate only to manifest["gates"]["triggers"]["split"] (or disallow all for gating).
| def score_fixture(text: str, rubric: list[dict[str, Any]]) -> dict[str, Any]: | ||
| score = 0 | ||
| anchors_hit: list[str] = [] | ||
| anchors_missed: list[str] = [] | ||
| for check in rubric: | ||
| pattern = re.compile(check["pattern"]) | ||
| match = bool(pattern.search(text)) | ||
| weight = int(check["weight"]) |
There was a problem hiding this comment.
score_fixture() compiles regex patterns from the scenario JSON without handling invalid patterns. A malformed pattern will raise re.error and crash with exit code 1 (stack trace) instead of emitting a structured runner error with exit code 2 like other manifest/fixture failures. Catch re.error (and possibly KeyError/TypeError for malformed rubric entries) and convert it into a logged [x] error + sys.exit(2).
| "triggers": { | ||
| "split": "val", | ||
| "should_fire_rate_min": 0.5, | ||
| "should_not_fire_rate_max": 0.5 |
There was a problem hiding this comment.
The triggers gate key should_not_fire_rate_max reads like a maximum correctness rate, but the runner treats it as a maximum miss/leak rate (see output field max_miss_rate). Rename this key to reflect its semantics (e.g. should_not_fire_miss_rate_max) or adjust the runner logic so the name and behavior match.
| "should_not_fire_rate_max": 0.5 | |
| "should_not_fire_miss_rate_max": 0.5 |
| with_score = score_fixture(with_text, scenario["rubric"]) | ||
| without_score = score_fixture(without_text, scenario["rubric"]) | ||
|
|
||
| delta_score = with_score["score"] - without_score["score"] | ||
| delta_anchors = len( | ||
| set(with_score["anchors_hit"]) - set(without_score["anchors_hit"]) | ||
| ) |
There was a problem hiding this comment.
delta_anchors is computed from anchors_hit, but score_fixture() currently appends penalty checks (negative weights) into anchors_hit when they match. This means a worse with_skill output that triggers a penalty (e.g., marketing tone) could increase delta_anchors and still pass the content gate. Consider tracking penalty hits separately (or excluding checks with weight < 0 from anchors_hit / delta_anchors calculations).
| with_score = score_fixture(with_text, scenario["rubric"]) | ||
| without_score = score_fixture(without_text, scenario["rubric"]) | ||
|
|
||
| delta_score = with_score["score"] - without_score["score"] | ||
| delta_anchors = len( | ||
| set(with_score["anchors_hit"]) - set(without_score["anchors_hit"]) | ||
| ) |
There was a problem hiding this comment.
delta_anchors is computed from anchors_hit, but score_fixture() currently appends penalty checks (negative weights) into anchors_hit when they match. This means a worse with_skill output that triggers a penalty (e.g., marketing tone) could increase delta_anchors and still pass the content gate. Consider tracking penalty hits separately (or excluding checks with weight < 0 from anchors_hit / delta_anchors calculations).
| def run_trigger_evals( | ||
| manifest: dict[str, Any], split: str | ||
| ) -> dict[str, Any]: | ||
| triggers_path = EVALS_DIR / manifest["triggers_manifest"] | ||
| triggers = _load_json(triggers_path)["items"] | ||
|
|
||
| if split != "all": | ||
| triggers = [t for t in triggers if t["split"] == split] | ||
|
|
There was a problem hiding this comment.
--split all is documented as "score both splits", but run_trigger_evals() currently aggregates train+val together into a single rate and applies the gate to that aggregate. This makes the ship gate ambiguous (it should typically be evaluated on the configured gate split, e.g. val). Consider returning per-split summaries when split=="all" and applying the gate only to manifest["gates"]["triggers"]["split"] (or disallow all for gating).
| ts = summary["timestamp_utc"].replace(":", "-").replace("+00-00", "Z") | ||
| out_path = results_dir / f"{ts}.json" | ||
| out_path.write_text(json.dumps(summary, indent=2) + "\n", encoding="utf-8") | ||
| summary["result_file"] = str(out_path.relative_to(SKILL_DIR.parent.parent.parent)) |
There was a problem hiding this comment.
summary["result_file"] = out_path.relative_to(...) will raise ValueError if the user passes --results-dir outside the repo/skill tree, causing the runner to crash after successfully writing results. Consider guarding this with try/except ValueError and falling back to an absolute path (or omit result_file) so --results-dir remains usable.
| summary["result_file"] = str(out_path.relative_to(SKILL_DIR.parent.parent.parent)) | |
| try: | |
| summary["result_file"] = str( | |
| out_path.relative_to(SKILL_DIR.parent.parent.parent) | |
| ) | |
| except ValueError: | |
| summary["result_file"] = str(out_path.resolve()) |
| "--manifest", | ||
| default=str(EVALS_DIR / "evals.json"), | ||
| help="Path to evals manifest (default: <skill>/evals/evals.json).", | ||
| ) |
There was a problem hiding this comment.
The --manifest flag suggests you can point the runner at an alternate manifest, but path resolution for triggers_manifest / content_manifests is hard-coded to EVALS_DIR (derived from the script location), not relative to the provided manifest file. If a user passes a manifest path outside the default directory, the runner will still try to load triggers/content from the built-in evals/ dir. Consider resolving relative paths against Path(args.manifest).parent (and passing that base into run_trigger_evals / run_content_evals).
| def score_fixture(text: str, rubric: list[dict[str, Any]]) -> dict[str, Any]: | ||
| score = 0 | ||
| anchors_hit: list[str] = [] | ||
| anchors_missed: list[str] = [] | ||
| for check in rubric: | ||
| pattern = re.compile(check["pattern"]) | ||
| match = bool(pattern.search(text)) | ||
| weight = int(check["weight"]) |
There was a problem hiding this comment.
score_fixture() compiles regex patterns from the scenario JSON without handling invalid patterns. A malformed pattern will raise re.error and crash with exit code 1 (stack trace) instead of emitting a structured runner error with exit code 2 like other manifest/fixture failures. Catch re.error (and possibly KeyError/TypeError for malformed rubric entries) and convert it into a logged [x] error + sys.exit(2).
Promotes [Unreleased] to [0.10.0] - 2026-04-27. Each PR since v0.9.4 gets one 'so what' line: - #926 Microsoft 365 Cowork target ships impl - #790 marketplace authoring CLI (init, package add/set, build, check, outdated, doctor, publish) -- collapsed from 20+ bullets to one - #722 marketplace plugin -> package rename + --help sectioning -- collapsed - #980 README 'Coming from npx skills add' conversion block - #981 docs auto-deploy on tag push (real fix for the #953 attempt) - #985 pr-description-skill evals suite - #984 pr-description-skill mermaid hardening - #989 cowork sys.platform mock for Windows CI Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
add(skills): evals suite for pr-description-skill (genesis-driven)
TL;DR
Adds a self-contained
evals/bundle and a stdlib-only runner under.apm/skills/pr-description-skill/so maintainers can answer twoquestions deterministically and without LLM keys: (a) does the
SKILL.md
description:reliably trigger on real intent and avoidnear-misses, and (b) does loading the SKILL.md body measurably
change the shape of an agent's PR description vs. not loading it.
All ship gates pass.
Problem (WHY)
The
pr-description-skillshipped without an evals harness, soevery change to its description or body has been an unmeasured
edit. The MODULE ENTRYPOINT canonical spec calls for content evals
(with-skill vs without-skill) and ~20 trigger evals (60/40
train/val) gated on the validation split — none of that existed.
The genesis discipline says: "if
with_skillandwithout_skillproduce indistinguishable outputs, the skill is not adding value."
We could not answer that question.
Approach (WHAT)
Walked genesis steps 1-6 to design the bundle, persisted the
handoff packet out-of-tree (session state), then drafted the
files in step 7b. The pattern is PIPELINE / single-loop
sequential (1 lens, 1 procedure) — no fan-out warranted.
Determinism is enforced everywhere: the trigger scorer is a
documented keyword/bigram matcher, content evals are pre-recorded
fixtures scored against regex rubrics, and exit codes are
machine-checkable.
Implementation (HOW)
.apm/skills/pr-description-skill/scripts/run_evals.py--filter,--split,--quiet,--no-write,--help. JSON on stdout, diagnostics on stderr, exit codes 0/1/2.evals/evals.jsonevals/triggers.jsondescription:field; should-not-fire queries pick adjacent intents (commit msg, release notes, changelog, issue, review, design doc).evals/content/{auth-refactor,docs-only,dep-bump}.jsonevals/fixtures/*__with_skill.mdand*__without_skill.mdevals/results/.gitkeep+evals/.gitignoreevals/README.md.github/skills/pr-description-skill/...apm install --target copilot.Diagram
Single-thread pipeline. The runner reads inputs, scores, emits.
Trade-offs
the trigger scorer is a keyword/bigram matcher, not a live
dispatcher. It is fast, key-free, and CI-friendly; the README
is explicit about the limitation and the manifest schema
accommodates a future
--llmflag.freshness for determinism. Maintainers regenerate them when
SKILL.md materially changes (documented in the README).
user-facing distribution-weight regression could push it out
to a maintainer-only
tests/skill-evals/directory; flaggedin the genesis handoff packet as LOW severity, not blocking
today.
runner is a maintainer script, not a dispatched module — no
trigger description, no agentic reasoning, no need to make it
a sibling SKILL.
Benefits
trigger gate + per-scenario content delta gate).
Validation
Ran the runner from the worktree root after every author step.
Final results from this commit:
auth-refactordocs-onlydep-bumpTargeted unit tests:
Raw result JSON inline below.
Full evals/results JSON (val split, all scenarios)
{ "schema_version": 1, "skill": "pr-description-skill", "timestamp_utc": "2026-04-27T09:02:47.191634+00:00", "filter": "all", "split": "val", "triggers": { "split": "val", "should_fire_correct_rate": 1.0, "should_not_fire_correct_rate": 1.0, "should_fire_gate": { "min": 0.5, "met": true }, "should_not_fire_gate": { "max_miss_rate": 0.5, "met": true }, "rows": [ { "id": "fire-06", "split": "val", "query": "let's open the PR -- write the body", "expected": "fire", "predicted": "fire", "passed": true, "diagnostic": { "reason": "primary_match", "match": "open the pr" } }, { "id": "fire-07", "split": "val", "query": "draft a pull request description for this branch", "expected": "fire", "predicted": "fire", "passed": true, "diagnostic": { "reason": "primary_match", "match": "pull request description" } }, { "id": "fire-08", "split": "val", "query": "pls write the pull request body", "expected": "fire", "predicted": "fire", "passed": true, "diagnostic": { "reason": "primary_match", "match": "pull request body" } }, { "id": "fire-09", "split": "val", "query": "fill in the PR template before we ship", "expected": "fire", "predicted": "fire", "passed": true, "diagnostic": { "reason": "primary_match", "match": "pr template" } }, { "id": "no-fire-06", "split": "val", "query": "summarize the diff for me", "expected": "no_fire", "predicted": "no_fire", "passed": true, "diagnostic": { "reason": "stop_list_hit", "match": "summarize the diff" } }, { "id": "no-fire-07", "split": "val", "query": "draft a code review comment", "expected": "no_fire", "predicted": "no_fire", "passed": true, "diagnostic": { "reason": "stop_list_hit", "match": "code review comment" } }, { "id": "no-fire-08", "split": "val", "query": "write the design doc for this branch", "expected": "no_fire", "predicted": "no_fire", "passed": true, "diagnostic": { "reason": "stop_list_hit", "match": "design doc" } }, { "id": "no-fire-09", "split": "val", "query": "give me release notes for this milestone", "expected": "no_fire", "predicted": "no_fire", "passed": true, "diagnostic": { "reason": "stop_list_hit", "match": "release note" } } ], "passed": true }, "content": { "delta_min_anchors": 1, "scenarios": [ { "id": "auth-refactor", "summary": "Cross-cutting auth refactor: token resolution moves into a single AuthResolver; multiple call sites updated; one breaking change to a public CLI flag. PR is non-trivial and exercises every required section.", "with_skill": { "score": 13, "anchors_hit": [ "tldr-present", "tldr-short", "problem-section", "anchored-quote", "approach-or-implementation", "mermaid-block", "tradeoffs-section", "validation-section", "validation-real-output", "how-to-test", "trailer" ], "anchors_missed": [] }, "without_skill": { "score": 0, "anchors_hit": [], "anchors_missed": [ "tldr-present", "tldr-short", "problem-section", "anchored-quote", "approach-or-implementation", "mermaid-block", "tradeoffs-section", "validation-section", "validation-real-output", "how-to-test", "trailer" ] }, "delta_score": 13, "delta_anchors": 11, "passed": true }, { "id": "docs-only", "summary": "Markdown-only PR adding a quickstart guide. The skill must keep TL;DR, Problem, Validation, How-to-test sections even though the PR is 'trivial' (per SKILL.md gotcha).", "with_skill": { "score": 8, "anchors_hit": [ "tldr-present", "problem-section", "anchored-quote", "validation-section", "validation-build", "how-to-test", "trailer" ], "anchors_missed": [] }, "without_skill": { "score": -1, "anchors_hit": [ "validation-build", "no-trivial-skip" ], "anchors_missed": [ "tldr-present", "problem-section", "anchored-quote", "validation-section", "how-to-test", "trailer" ] }, "delta_score": 9, "delta_anchors": 6, "passed": true }, { "id": "dep-bump", "summary": "Mechanical dependency upgrade: bump click from 8.1.7 to 8.2.1; lockfile-only changes plus a 1-line type annotation tweak. Tests Trade-offs may be 1-2 bullets and mermaid is optional.", "with_skill": { "score": 9, "anchors_hit": [ "tldr-present", "tldr-short", "problem-section", "implementation-section", "tradeoffs-section", "validation-section", "validation-real-output", "how-to-test", "trailer" ], "anchors_missed": [] }, "without_skill": { "score": -2, "anchors_hit": [ "no-marketing-tone" ], "anchors_missed": [ "tldr-present", "tldr-short", "problem-section", "implementation-section", "tradeoffs-section", "validation-section", "validation-real-output", "how-to-test", "trailer" ] }, "delta_score": 11, "delta_anchors": 9, "passed": true } ], "passed": true }, "passed": true }Genesis findings on the skill itself
The genesis handoff packet recorded one LOW-severity finding:
the eval bundle adds maintainer-only weight inside the skill's
distribution surface (see Step 3.5 of the packet). Not blocking
today; revisit if user-visible install-size regresses.
No findings of higher severity. The content-eval delta is
positive on every scenario (smallest delta = 6 anchors), so
the skill IS adding measurable value over the no-skill baseline
on every shape we tested. Per the genesis discipline, this
clears the "redesign or delete" gate.
How to test
python .apm/skills/pr-description-skill/scripts/run_evals.py— expect overall
passed: trueand exit code 0.python .apm/skills/pr-description-skill/scripts/run_evals.py --filter triggers --split all— expect 18/18 trigger evals correct.
python .apm/skills/pr-description-skill/scripts/run_evals.py --help— expect a documented argparse help block.
uv run pytest tests/unit -x -k "skill or pr_description"— expect 320 passing.
.apm/skills/pr-description-skill/evals/README.mdand confirm the scoring rules and regeneration policy
match expectations.
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com