Skip to content

Add evals suite to pr-description-skill (genesis-driven)#985

Merged
danielmeppiel merged 1 commit intomainfrom
feat/pr-description-skill-evals
Apr 27, 2026
Merged

Add evals suite to pr-description-skill (genesis-driven)#985
danielmeppiel merged 1 commit intomainfrom
feat/pr-description-skill-evals

Conversation

@danielmeppiel
Copy link
Copy Markdown
Collaborator

add(skills): evals suite for pr-description-skill (genesis-driven)

TL;DR

Adds a self-contained evals/ bundle and a stdlib-only runner under
.apm/skills/pr-description-skill/ so maintainers can answer two
questions deterministically and without LLM keys: (a) does the
SKILL.md description: reliably trigger on real intent and avoid
near-misses, and (b) does loading the SKILL.md body measurably
change the shape of an agent's PR description vs. not loading it.
All ship gates pass.

Problem (WHY)

The pr-description-skill shipped without an evals harness, so
every change to its description or body has been an unmeasured
edit. The MODULE ENTRYPOINT canonical spec calls for content evals
(with-skill vs without-skill) and ~20 trigger evals (60/40
train/val) gated on the validation split — none of that existed.
The genesis discipline says: "if with_skill and without_skill
produce indistinguishable outputs, the skill is not adding value."
We could not answer that question.

Approach (WHAT)

Walked genesis steps 1-6 to design the bundle, persisted the
handoff packet out-of-tree (session state), then drafted the
files in step 7b. The pattern is PIPELINE / single-loop
sequential
(1 lens, 1 procedure) — no fan-out warranted.
Determinism is enforced everywhere: the trigger scorer is a
documented keyword/bigram matcher, content evals are pre-recorded
fixtures scored against regex rubrics, and exit codes are
machine-checkable.

Implementation (HOW)

Path Purpose
.apm/skills/pr-description-skill/scripts/run_evals.py Stdlib-only runner; --filter, --split, --quiet, --no-write, --help. JSON on stdout, diagnostics on stderr, exit codes 0/1/2.
evals/evals.json Top-level manifest: gates, keyword lists, content scenario list.
evals/triggers.json 18 trigger items (9 should-fire + 9 should-not-fire), ~60/40 train/val. Should-fire queries authored verbatim against the skill's description: field; should-not-fire queries pick adjacent intents (commit msg, release notes, changelog, issue, review, design doc).
evals/content/{auth-refactor,docs-only,dep-bump}.json Three PR-shape scenarios + regex rubrics. Shapes chosen for diversity: cross-cutting refactor, docs-only, mechanical dep bump.
evals/fixtures/*__with_skill.md and *__without_skill.md Pre-recorded representative outputs for each scenario × condition. Deterministic, no API key needed. README documents the regeneration policy.
evals/results/.gitkeep + evals/.gitignore Track the directory; ignore generated result JSON.
evals/README.md How to run, scoring rules, gates, how to add evals, regeneration policy.
.github/skills/pr-description-skill/... Mirror regenerated by apm install --target copilot.

Diagram

Single-thread pipeline. The runner reads inputs, scores, emits.

flowchart LR
  M["evals/evals.json"] --> R["run_evals.py"]
  T["evals/triggers.json"] --> R
  C["evals/content/*.json"] --> R
  F["evals/fixtures/*.md"] --> R
  S["SKILL.md (read-only)"] --> R
  R --> O["stdout JSON"]
  R --> E["stderr diagnostics"]
  R --> RES["evals/results/<UTC-ISO>.json"]
Loading

Trade-offs

  • Deterministic dispatcher approximation, not a real LLM:
    the trigger scorer is a keyword/bigram matcher, not a live
    dispatcher. It is fast, key-free, and CI-friendly; the README
    is explicit about the limitation and the manifest schema
    accommodates a future --llm flag.
  • Pre-recorded fixtures, not regenerated per run: trades
    freshness for determinism. Maintainers regenerate them when
    SKILL.md materially changes (documented in the README).
  • Eval bundle ships under the skill's source tree: a future
    user-facing distribution-weight regression could push it out
    to a maintainer-only tests/skill-evals/ directory; flagged
    in the genesis handoff packet as LOW severity, not blocking
    today.
  • Rejected: a separate eval-runner skill primitive. The
    runner is a maintainer script, not a dispatched module — no
    trigger description, no agentic reasoning, no need to make it
    a sibling SKILL.

Benefits

  1. The skill now ships with measurable ship gates (val-split
    trigger gate + per-scenario content delta gate).
  2. Future SKILL.md edits get a regression check for free.
  3. Adding a new scenario or trigger query is one JSON edit.
  4. Zero new dependencies (Python stdlib).

Validation

Ran the runner from the worktree root after every author step.
Final results from this commit:

$ python .apm/skills/pr-description-skill/scripts/run_evals.py
Family Metric Value Gate Met
triggers (val split) should-fire correctness 1.00 >= 0.5 yes
triggers (val split) should-not-fire correctness 1.00 >= 0.5 yes
content auth-refactor delta_anchors 11 >= 1 yes
content docs-only delta_anchors 6 >= 1 yes
content dep-bump delta_anchors 9 >= 1 yes

Targeted unit tests:

$ uv run pytest tests/unit -x -k "skill or pr_description"
320 passed, 5254 deselected in 11.27s

Raw result JSON inline below.

Full evals/results JSON (val split, all scenarios)
{
  "schema_version": 1,
  "skill": "pr-description-skill",
  "timestamp_utc": "2026-04-27T09:02:47.191634+00:00",
  "filter": "all",
  "split": "val",
  "triggers": {
    "split": "val",
    "should_fire_correct_rate": 1.0,
    "should_not_fire_correct_rate": 1.0,
    "should_fire_gate": {
      "min": 0.5,
      "met": true
    },
    "should_not_fire_gate": {
      "max_miss_rate": 0.5,
      "met": true
    },
    "rows": [
      {
        "id": "fire-06",
        "split": "val",
        "query": "let's open the PR -- write the body",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "open the pr"
        }
      },
      {
        "id": "fire-07",
        "split": "val",
        "query": "draft a pull request description for this branch",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pull request description"
        }
      },
      {
        "id": "fire-08",
        "split": "val",
        "query": "pls write the pull request body",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pull request body"
        }
      },
      {
        "id": "fire-09",
        "split": "val",
        "query": "fill in the PR template before we ship",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pr template"
        }
      },
      {
        "id": "no-fire-06",
        "split": "val",
        "query": "summarize the diff for me",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "summarize the diff"
        }
      },
      {
        "id": "no-fire-07",
        "split": "val",
        "query": "draft a code review comment",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "code review comment"
        }
      },
      {
        "id": "no-fire-08",
        "split": "val",
        "query": "write the design doc for this branch",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "design doc"
        }
      },
      {
        "id": "no-fire-09",
        "split": "val",
        "query": "give me release notes for this milestone",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "release note"
        }
      }
    ],
    "passed": true
  },
  "content": {
    "delta_min_anchors": 1,
    "scenarios": [
      {
        "id": "auth-refactor",
        "summary": "Cross-cutting auth refactor: token resolution moves into a single AuthResolver; multiple call sites updated; one breaking change to a public CLI flag. PR is non-trivial and exercises every required section.",
        "with_skill": {
          "score": 13,
          "anchors_hit": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "anchored-quote",
            "approach-or-implementation",
            "mermaid-block",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": 0,
          "anchors_hit": [],
          "anchors_missed": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "anchored-quote",
            "approach-or-implementation",
            "mermaid-block",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 13,
        "delta_anchors": 11,
        "passed": true
      },
      {
        "id": "docs-only",
        "summary": "Markdown-only PR adding a quickstart guide. The skill must keep TL;DR, Problem, Validation, How-to-test sections even though the PR is 'trivial' (per SKILL.md gotcha).",
        "with_skill": {
          "score": 8,
          "anchors_hit": [
            "tldr-present",
            "problem-section",
            "anchored-quote",
            "validation-section",
            "validation-build",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": -1,
          "anchors_hit": [
            "validation-build",
            "no-trivial-skip"
          ],
          "anchors_missed": [
            "tldr-present",
            "problem-section",
            "anchored-quote",
            "validation-section",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 9,
        "delta_anchors": 6,
        "passed": true
      },
      {
        "id": "dep-bump",
        "summary": "Mechanical dependency upgrade: bump click from 8.1.7 to 8.2.1; lockfile-only changes plus a 1-line type annotation tweak. Tests Trade-offs may be 1-2 bullets and mermaid is optional.",
        "with_skill": {
          "score": 9,
          "anchors_hit": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "implementation-section",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": -2,
          "anchors_hit": [
            "no-marketing-tone"
          ],
          "anchors_missed": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "implementation-section",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 11,
        "delta_anchors": 9,
        "passed": true
      }
    ],
    "passed": true
  },
  "passed": true
}

Genesis findings on the skill itself

The genesis handoff packet recorded one LOW-severity finding:
the eval bundle adds maintainer-only weight inside the skill's
distribution surface (see Step 3.5 of the packet). Not blocking
today; revisit if user-visible install-size regresses.

No findings of higher severity. The content-eval delta is
positive on every scenario (smallest delta = 6 anchors), so
the skill IS adding measurable value over the no-skill baseline
on every shape we tested. Per the genesis discipline, this
clears the "redesign or delete" gate.

How to test

  • python .apm/skills/pr-description-skill/scripts/run_evals.py
    — expect overall passed: true and exit code 0.
  • python .apm/skills/pr-description-skill/scripts/run_evals.py --filter triggers --split all
    — expect 18/18 trigger evals correct.
  • python .apm/skills/pr-description-skill/scripts/run_evals.py --help
    — expect a documented argparse help block.
  • uv run pytest tests/unit -x -k "skill or pr_description"
    — expect 320 passing.
  • Inspect .apm/skills/pr-description-skill/evals/README.md
    and confirm the scoring rules and regeneration policy
    match expectations.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Adds a self-contained evals/ bundle and a non-interactive runner
under .apm/skills/pr-description-skill/. Mirror regenerated via
apm install --target copilot.

- evals/evals.json: top-level manifest + ship gates
- evals/triggers.json: 18 trigger items (9 fire / 9 no-fire),
  60/40 train/val split
- evals/content/*.json: 3 PR-shape scenarios (cross-cutting
  refactor, docs-only, mechanical dep bump) each with regex
  rubric and with_skill/without_skill fixtures
- evals/fixtures/*.md: pre-recorded sample outputs for the two
  conditions; deterministic, no LLM API required at runtime
- scripts/run_evals.py: stdlib-only runner; --filter, --split,
  --quiet, --no-write, --help; JSON on stdout, diagnostics on
  stderr, exit codes 0/1/2
- evals/results/.gitkeep + evals/.gitignore: track the dir,
  ignore generated result JSON
- evals/README.md: how to run, scoring rules, gates, how to
  add evals, regeneration policy

Genesis handoff packet persisted out-of-tree at the session
state dir; design ends at step 6, this commit is step 7b output.

Validation (val split):
- triggers should-fire correctness = 1.0 (gate >= 0.5)
- triggers should-not-fire correctness = 1.0 (gate >= 0.5)
- content delta_anchors per scenario = 11, 6, 9 (gate >= 1)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 27, 2026 09:03
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Raw evals/results JSON (val split, all scenarios)

Captured from python .apm/skills/pr-description-skill/scripts/run_evals.py
on the head commit of this PR.

Click to expand full result JSON
{
  "schema_version": 1,
  "skill": "pr-description-skill",
  "timestamp_utc": "2026-04-27T09:02:47.191634+00:00",
  "filter": "all",
  "split": "val",
  "triggers": {
    "split": "val",
    "should_fire_correct_rate": 1.0,
    "should_not_fire_correct_rate": 1.0,
    "should_fire_gate": {
      "min": 0.5,
      "met": true
    },
    "should_not_fire_gate": {
      "max_miss_rate": 0.5,
      "met": true
    },
    "rows": [
      {
        "id": "fire-06",
        "split": "val",
        "query": "let's open the PR -- write the body",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "open the pr"
        }
      },
      {
        "id": "fire-07",
        "split": "val",
        "query": "draft a pull request description for this branch",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pull request description"
        }
      },
      {
        "id": "fire-08",
        "split": "val",
        "query": "pls write the pull request body",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pull request body"
        }
      },
      {
        "id": "fire-09",
        "split": "val",
        "query": "fill in the PR template before we ship",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pr template"
        }
      },
      {
        "id": "no-fire-06",
        "split": "val",
        "query": "summarize the diff for me",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "summarize the diff"
        }
      },
      {
        "id": "no-fire-07",
        "split": "val",
        "query": "draft a code review comment",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "code review comment"
        }
      },
      {
        "id": "no-fire-08",
        "split": "val",
        "query": "write the design doc for this branch",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "design doc"
        }
      },
      {
        "id": "no-fire-09",
        "split": "val",
        "query": "give me release notes for this milestone",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "release note"
        }
      }
    ],
    "passed": true
  },
  "content": {
    "delta_min_anchors": 1,
    "scenarios": [
      {
        "id": "auth-refactor",
        "summary": "Cross-cutting auth refactor: token resolution moves into a single AuthResolver; multiple call sites updated; one breaking change to a public CLI flag. PR is non-trivial and exercises every required section.",
        "with_skill": {
          "score": 13,
          "anchors_hit": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "anchored-quote",
            "approach-or-implementation",
            "mermaid-block",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": 0,
          "anchors_hit": [],
          "anchors_missed": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "anchored-quote",
            "approach-or-implementation",
            "mermaid-block",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 13,
        "delta_anchors": 11,
        "passed": true
      },
      {
        "id": "docs-only",
        "summary": "Markdown-only PR adding a quickstart guide. The skill must keep TL;DR, Problem, Validation, How-to-test sections even though the PR is 'trivial' (per SKILL.md gotcha).",
        "with_skill": {
          "score": 8,
          "anchors_hit": [
            "tldr-present",
            "problem-section",
            "anchored-quote",
            "validation-section",
            "validation-build",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": -1,
          "anchors_hit": [
            "validation-build",
            "no-trivial-skip"
          ],
          "anchors_missed": [
            "tldr-present",
            "problem-section",
            "anchored-quote",
            "validation-section",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 9,
        "delta_anchors": 6,
        "passed": true
      },
      {
        "id": "dep-bump",
        "summary": "Mechanical dependency upgrade: bump click from 8.1.7 to 8.2.1; lockfile-only changes plus a 1-line type annotation tweak. Tests Trade-offs may be 1-2 bullets and mermaid is optional.",
        "with_skill": {
          "score": 9,
          "anchors_hit": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "implementation-section",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": -2,
          "anchors_hit": [
            "no-marketing-tone"
          ],
          "anchors_missed": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "implementation-section",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 11,
        "delta_anchors": 9,
        "passed": true
      }
    ],
    "passed": true
  },
  "passed": true
}

Validation summary (val split is the ship gate):

  • triggers should-fire correctness = 1.00 (gate >= 0.5)
  • triggers should-not-fire correctness = 1.00 (gate >= 0.5)
  • content auth-refactor delta_anchors = 11 (gate >= 1)
  • content docs-only delta_anchors = 6 (gate >= 1)
  • content dep-bump delta_anchors = 9 (gate >= 1)

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

@danielmeppiel danielmeppiel merged commit ee29b0b into main Apr 27, 2026
17 checks passed
@danielmeppiel danielmeppiel deleted the feat/pr-description-skill-evals branch April 27, 2026 09:05
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Post-merge evals run on main

Ran the evals harness on main after PR #985 merged. Confirms the suite works on the as-merged tree.

$ uv run python .apm/skills/pr-description-skill/scripts/run_evals.py

Result: PASSED (file: 2026-04-27T09-07-33.079410Z.json)

Trigger evals (val split = ship gate)

Metric Value Gate Met?
should_fire_correct_rate 1.00 >= 0.50 yes
should_not_fire_correct_rate 1.00 <= 0.50 miss yes

Content evals (with_skill vs without_skill)

Scenario with without delta_anchors passed
auth-refactor 13 0 +11 yes
docs-only 8 -1 +6 yes
dep-bump 9 -2 +9 yes

All scenarios clear the genesis "redesign or delete" gate (delta >= 1).

Raw result JSON
{
  "schema_version": 1,
  "skill": "pr-description-skill",
  "timestamp_utc": "2026-04-27T09:07:33.079410+00:00",
  "filter": "all",
  "split": "val",
  "triggers": {
    "split": "val",
    "should_fire_correct_rate": 1.0,
    "should_not_fire_correct_rate": 1.0,
    "should_fire_gate": {
      "min": 0.5,
      "met": true
    },
    "should_not_fire_gate": {
      "max_miss_rate": 0.5,
      "met": true
    },
    "rows": [
      {
        "id": "fire-06",
        "split": "val",
        "query": "let's open the PR -- write the body",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "open the pr"
        }
      },
      {
        "id": "fire-07",
        "split": "val",
        "query": "draft a pull request description for this branch",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pull request description"
        }
      },
      {
        "id": "fire-08",
        "split": "val",
        "query": "pls write the pull request body",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pull request body"
        }
      },
      {
        "id": "fire-09",
        "split": "val",
        "query": "fill in the PR template before we ship",
        "expected": "fire",
        "predicted": "fire",
        "passed": true,
        "diagnostic": {
          "reason": "primary_match",
          "match": "pr template"
        }
      },
      {
        "id": "no-fire-06",
        "split": "val",
        "query": "summarize the diff for me",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "summarize the diff"
        }
      },
      {
        "id": "no-fire-07",
        "split": "val",
        "query": "draft a code review comment",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "code review comment"
        }
      },
      {
        "id": "no-fire-08",
        "split": "val",
        "query": "write the design doc for this branch",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "design doc"
        }
      },
      {
        "id": "no-fire-09",
        "split": "val",
        "query": "give me release notes for this milestone",
        "expected": "no_fire",
        "predicted": "no_fire",
        "passed": true,
        "diagnostic": {
          "reason": "stop_list_hit",
          "match": "release note"
        }
      }
    ],
    "passed": true
  },
  "content": {
    "delta_min_anchors": 1,
    "scenarios": [
      {
        "id": "auth-refactor",
        "summary": "Cross-cutting auth refactor: token resolution moves into a single AuthResolver; multiple call sites updated; one breaking change to a public CLI flag. PR is non-trivial and exercises every required section.",
        "with_skill": {
          "score": 13,
          "anchors_hit": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "anchored-quote",
            "approach-or-implementation",
            "mermaid-block",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": 0,
          "anchors_hit": [],
          "anchors_missed": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "anchored-quote",
            "approach-or-implementation",
            "mermaid-block",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 13,
        "delta_anchors": 11,
        "passed": true
      },
      {
        "id": "docs-only",
        "summary": "Markdown-only PR adding a quickstart guide. The skill must keep TL;DR, Problem, Validation, How-to-test sections even though the PR is 'trivial' (per SKILL.md gotcha).",
        "with_skill": {
          "score": 8,
          "anchors_hit": [
            "tldr-present",
            "problem-section",
            "anchored-quote",
            "validation-section",
            "validation-build",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": -1,
          "anchors_hit": [
            "validation-build",
            "no-trivial-skip"
          ],
          "anchors_missed": [
            "tldr-present",
            "problem-section",
            "anchored-quote",
            "validation-section",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 9,
        "delta_anchors": 6,
        "passed": true
      },
      {
        "id": "dep-bump",
        "summary": "Mechanical dependency upgrade: bump click from 8.1.7 to 8.2.1; lockfile-only changes plus a 1-line type annotation tweak. Tests Trade-offs may be 1-2 bullets and mermaid is optional.",
        "with_skill": {
          "score": 9,
          "anchors_hit": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "implementation-section",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ],
          "anchors_missed": []
        },
        "without_skill": {
          "score": -2,
          "anchors_hit": [
            "no-marketing-tone"
          ],
          "anchors_missed": [
            "tldr-present",
            "tldr-short",
            "problem-section",
            "implementation-section",
            "tradeoffs-section",
            "validation-section",
            "validation-real-output",
            "how-to-test",
            "trailer"
          ]
        },
        "delta_score": 11,
        "delta_anchors": 9,
        "passed": true
      }
    ],
    "passed": true
  },
  "passed": true
}

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a deterministic, stdlib-only eval harness for pr-description-skill, intended to let maintainers validate trigger matching (should-fire/should-not-fire) and measure content “shape” deltas (with-skill vs without-skill) without requiring any LLM/API keys. The changes are mirrored into both .apm/ (authoritative sources) and .github/ (Copilot target output).

Changes:

  • Introduces scripts/run_evals.py runner (JSON summary on stdout, diagnostics on stderr, exit codes 0/1/2).
  • Adds trigger eval dataset (triggers.json) plus content scenarios, fixtures, and rubrics under evals/.
  • Adds eval bundle docs and ignores generated results while keeping a tracked results/ directory.
Show a summary per file
File Description
.github/skills/pr-description-skill/scripts/run_evals.py Mirrored eval runner script for Copilot target
.github/skills/pr-description-skill/evals/evals.json Mirrored top-level eval manifest (gates + keyword lists)
.github/skills/pr-description-skill/evals/triggers.json Mirrored trigger eval items (train/val split)
.github/skills/pr-description-skill/evals/content/auth-refactor.json Mirrored content scenario + rubric
.github/skills/pr-description-skill/evals/content/docs-only.json Mirrored content scenario + rubric
.github/skills/pr-description-skill/evals/content/dep-bump.json Mirrored content scenario + rubric
.github/skills/pr-description-skill/evals/fixtures/auth-refactor__with_skill.md Mirrored with-skill fixture output
.github/skills/pr-description-skill/evals/fixtures/auth-refactor__without_skill.md Mirrored without-skill fixture output
.github/skills/pr-description-skill/evals/fixtures/docs-only__with_skill.md Mirrored with-skill fixture output
.github/skills/pr-description-skill/evals/fixtures/docs-only__without_skill.md Mirrored without-skill fixture output
.github/skills/pr-description-skill/evals/fixtures/dep-bump__with_skill.md Mirrored with-skill fixture output
.github/skills/pr-description-skill/evals/fixtures/dep-bump__without_skill.md Mirrored without-skill fixture output
.github/skills/pr-description-skill/evals/results/.gitkeep Mirrored tracked sentinel for results directory
.github/skills/pr-description-skill/evals/.gitignore Mirrored ignore for generated result JSON files
.github/skills/pr-description-skill/evals/README.md Mirrored documentation for running/scoring evals
.apm/skills/pr-description-skill/scripts/run_evals.py Source eval runner script for skill
.apm/skills/pr-description-skill/evals/evals.json Source top-level eval manifest (gates + keyword lists)
.apm/skills/pr-description-skill/evals/triggers.json Source trigger eval items (train/val split)
.apm/skills/pr-description-skill/evals/content/auth-refactor.json Source content scenario + rubric
.apm/skills/pr-description-skill/evals/content/docs-only.json Source content scenario + rubric
.apm/skills/pr-description-skill/evals/content/dep-bump.json Source content scenario + rubric
.apm/skills/pr-description-skill/evals/fixtures/auth-refactor__with_skill.md Source with-skill fixture output
.apm/skills/pr-description-skill/evals/fixtures/auth-refactor__without_skill.md Source without-skill fixture output
.apm/skills/pr-description-skill/evals/fixtures/docs-only__with_skill.md Source with-skill fixture output
.apm/skills/pr-description-skill/evals/fixtures/docs-only__without_skill.md Source without-skill fixture output
.apm/skills/pr-description-skill/evals/fixtures/dep-bump__with_skill.md Source with-skill fixture output
.apm/skills/pr-description-skill/evals/fixtures/dep-bump__without_skill.md Source without-skill fixture output
.apm/skills/pr-description-skill/evals/results/.gitkeep Source tracked sentinel for results directory
.apm/skills/pr-description-skill/evals/.gitignore Source ignore for generated result JSON files
.apm/skills/pr-description-skill/evals/README.md Source documentation for running/scoring evals

Copilot's findings

  • Files reviewed: 28/30 changed files
  • Comments generated: 16

Comment on lines +141 to +145
gates = manifest["gates"]["triggers"]
fire_gate_met = fire_rate >= gates["should_fire_rate_min"]
no_fire_gate_met = (
(no_fire_rate) >= (1.0 - gates["should_not_fire_rate_max"])
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gate field name should_not_fire_rate_max is treated as a max miss/leak rate (see 1.0 - ...) and is reported as max_miss_rate in output. This is easy to misconfigure because the name reads like a maximum correctness rate. Rename the manifest key to something like should_not_fire_miss_rate_max (and update the output/README accordingly), or adjust the logic to match the current name.

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +109
def run_trigger_evals(
manifest: dict[str, Any], split: str
) -> dict[str, Any]:
triggers_path = EVALS_DIR / manifest["triggers_manifest"]
triggers = _load_json(triggers_path)["items"]

if split != "all":
triggers = [t for t in triggers if t["split"] == split]

Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--split all is documented as "score both splits", but run_trigger_evals() currently aggregates train+val together into a single rate and applies the gate to that aggregate. This makes the ship gate ambiguous (it should typically be evaluated on the configured gate split, e.g. val). Consider returning per-split summaries when split=="all" and applying the gate only to manifest["gates"]["triggers"]["split"] (or disallow all for gating).

Copilot uses AI. Check for mistakes.
Comment on lines +168 to +175
def score_fixture(text: str, rubric: list[dict[str, Any]]) -> dict[str, Any]:
score = 0
anchors_hit: list[str] = []
anchors_missed: list[str] = []
for check in rubric:
pattern = re.compile(check["pattern"])
match = bool(pattern.search(text))
weight = int(check["weight"])
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

score_fixture() compiles regex patterns from the scenario JSON without handling invalid patterns. A malformed pattern will raise re.error and crash with exit code 1 (stack trace) instead of emitting a structured runner error with exit code 2 like other manifest/fixture failures. Catch re.error (and possibly KeyError/TypeError for malformed rubric entries) and convert it into a logged [x] error + sys.exit(2).

Copilot uses AI. Check for mistakes.
"triggers": {
"split": "val",
"should_fire_rate_min": 0.5,
"should_not_fire_rate_max": 0.5
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The triggers gate key should_not_fire_rate_max reads like a maximum correctness rate, but the runner treats it as a maximum miss/leak rate (see output field max_miss_rate). Rename this key to reflect its semantics (e.g. should_not_fire_miss_rate_max) or adjust the runner logic so the name and behavior match.

Suggested change
"should_not_fire_rate_max": 0.5
"should_not_fire_miss_rate_max": 0.5

Copilot uses AI. Check for mistakes.
Comment on lines +205 to +211
with_score = score_fixture(with_text, scenario["rubric"])
without_score = score_fixture(without_text, scenario["rubric"])

delta_score = with_score["score"] - without_score["score"]
delta_anchors = len(
set(with_score["anchors_hit"]) - set(without_score["anchors_hit"])
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delta_anchors is computed from anchors_hit, but score_fixture() currently appends penalty checks (negative weights) into anchors_hit when they match. This means a worse with_skill output that triggers a penalty (e.g., marketing tone) could increase delta_anchors and still pass the content gate. Consider tracking penalty hits separately (or excluding checks with weight < 0 from anchors_hit / delta_anchors calculations).

Copilot uses AI. Check for mistakes.
Comment on lines +205 to +211
with_score = score_fixture(with_text, scenario["rubric"])
without_score = score_fixture(without_text, scenario["rubric"])

delta_score = with_score["score"] - without_score["score"]
delta_anchors = len(
set(with_score["anchors_hit"]) - set(without_score["anchors_hit"])
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delta_anchors is computed from anchors_hit, but score_fixture() currently appends penalty checks (negative weights) into anchors_hit when they match. This means a worse with_skill output that triggers a penalty (e.g., marketing tone) could increase delta_anchors and still pass the content gate. Consider tracking penalty hits separately (or excluding checks with weight < 0 from anchors_hit / delta_anchors calculations).

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +109
def run_trigger_evals(
manifest: dict[str, Any], split: str
) -> dict[str, Any]:
triggers_path = EVALS_DIR / manifest["triggers_manifest"]
triggers = _load_json(triggers_path)["items"]

if split != "all":
triggers = [t for t in triggers if t["split"] == split]

Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--split all is documented as "score both splits", but run_trigger_evals() currently aggregates train+val together into a single rate and applies the gate to that aggregate. This makes the ship gate ambiguous (it should typically be evaluated on the configured gate split, e.g. val). Consider returning per-split summaries when split=="all" and applying the gate only to manifest["gates"]["triggers"]["split"] (or disallow all for gating).

Copilot uses AI. Check for mistakes.
ts = summary["timestamp_utc"].replace(":", "-").replace("+00-00", "Z")
out_path = results_dir / f"{ts}.json"
out_path.write_text(json.dumps(summary, indent=2) + "\n", encoding="utf-8")
summary["result_file"] = str(out_path.relative_to(SKILL_DIR.parent.parent.parent))
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summary["result_file"] = out_path.relative_to(...) will raise ValueError if the user passes --results-dir outside the repo/skill tree, causing the runner to crash after successfully writing results. Consider guarding this with try/except ValueError and falling back to an absolute path (or omit result_file) so --results-dir remains usable.

Suggested change
summary["result_file"] = str(out_path.relative_to(SKILL_DIR.parent.parent.parent))
try:
summary["result_file"] = str(
out_path.relative_to(SKILL_DIR.parent.parent.parent)
)
except ValueError:
summary["result_file"] = str(out_path.resolve())

Copilot uses AI. Check for mistakes.
Comment on lines +266 to +269
"--manifest",
default=str(EVALS_DIR / "evals.json"),
help="Path to evals manifest (default: <skill>/evals/evals.json).",
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --manifest flag suggests you can point the runner at an alternate manifest, but path resolution for triggers_manifest / content_manifests is hard-coded to EVALS_DIR (derived from the script location), not relative to the provided manifest file. If a user passes a manifest path outside the default directory, the runner will still try to load triggers/content from the built-in evals/ dir. Consider resolving relative paths against Path(args.manifest).parent (and passing that base into run_trigger_evals / run_content_evals).

Copilot uses AI. Check for mistakes.
Comment on lines +168 to +175
def score_fixture(text: str, rubric: list[dict[str, Any]]) -> dict[str, Any]:
score = 0
anchors_hit: list[str] = []
anchors_missed: list[str] = []
for check in rubric:
pattern = re.compile(check["pattern"])
match = bool(pattern.search(text))
weight = int(check["weight"])
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

score_fixture() compiles regex patterns from the scenario JSON without handling invalid patterns. A malformed pattern will raise re.error and crash with exit code 1 (stack trace) instead of emitting a structured runner error with exit code 2 like other manifest/fixture failures. Catch re.error (and possibly KeyError/TypeError for malformed rubric entries) and convert it into a logged [x] error + sys.exit(2).

Copilot uses AI. Check for mistakes.
danielmeppiel pushed a commit that referenced this pull request Apr 27, 2026
Promotes [Unreleased] to [0.10.0] - 2026-04-27. Each PR since v0.9.4
gets one 'so what' line:

- #926 Microsoft 365 Cowork target ships impl
- #790 marketplace authoring CLI (init, package add/set, build, check,
  outdated, doctor, publish) -- collapsed from 20+ bullets to one
- #722 marketplace plugin -> package rename + --help sectioning -- collapsed
- #980 README 'Coming from npx skills add' conversion block
- #981 docs auto-deploy on tag push (real fix for the #953 attempt)
- #985 pr-description-skill evals suite
- #984 pr-description-skill mermaid hardening
- #989 cowork sys.platform mock for Windows CI

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants