microsoft · danielmeppiel · Apr 27, 2026 · Apr 27, 2026 · Copilot · Apr 27, 2026
@@ -0,0 +1,2 @@
+# Ignore generated result files; track only the .gitkeep sentinel.
+results/*.json
@@ -0,0 +1,132 @@
+# pr-description-skill evals
+
+This bundle answers two questions deterministically and without
+requiring an LLM API key:
+
+1. **TRIGGER EVALS**: does the SKILL.md `description:` reliably
+   match real should-fire intents and avoid near-miss queries?
+2. **CONTENT EVALS**: does loading the SKILL.md body change the
+   shape of a PR description an agent produces, vs not loading
+   it?
+
+## Layout
+
+```
+evals/
+  evals.json            # top-level manifest (gates, keyword lists)
+  triggers.json         # 18 trigger items (9 fire / 9 no-fire),
+                        # ~60/40 train/val split
+  content/
+    auth-refactor.json  # cross-cutting refactor scenario + rubric
+    docs-only.json      # docs-only PR scenario + rubric
+    dep-bump.json       # mechanical dep bump scenario + rubric
+  fixtures/
+    <id>__with_skill.md      # representative output produced
+                              # under the SKILL.md guidance
+    <id>__without_skill.md   # representative output produced
+                              # without the skill loaded
+  results/
+    .gitkeep            # tracked sentinel
+    <UTC-iso>.json      # one file per run (gitignored)
+  README.md             # this file
+```
+
+The runner script lives at
+`.apm/skills/pr-description-skill/scripts/run_evals.py`.
+
+## Run
+
+From the repo root (or this skill's directory):
+
+```
+python .apm/skills/pr-description-skill/scripts/run_evals.py
+```
+
+Common options:
+
+| Flag | Effect |
+|---|---|
+| `--filter triggers` | Run only the trigger evals. |
+| `--filter content` | Run only the content evals. |
+| `--split train` | Score the train split for triggers (default is `val`, the ship gate). |
+| `--split all` | Score both splits. |
+| `--no-write` | Do not write to `evals/results/`. |
+| `--quiet` | Suppress stderr diagnostics. |
+
+Exit codes:
+
+* `0` = all gates met
+* `1` = one or more gates failed
+* `2` = runner error (missing manifest, parse error, missing fixture)
+
+## Trigger eval scoring
+
+The runner uses a deterministic dispatcher approximation defined in
+`evals.json`:
+
+1. If any phrase from `stop_list` appears verbatim in the query
+   (lowercased), predict `no_fire`.
+2. Otherwise if any phrase from `trigger_keywords_primary` appears
+   verbatim, predict `fire`.
+3. Otherwise count distinct tokens from
+   `trigger_keywords_secondary`; predict `fire` iff at least 3
+   distinct tokens AND one of `{pr, pull}` is present.
+
+This is NOT a perfect proxy for an actual LLM dispatcher. It is
+fast, deterministic, and CI-friendly. When `gh models` becomes
+available, an `--llm` flag can be added that calls the real
+dispatcher; the manifest schema already accommodates it.
+
+Ship gate (validation split):
+
+* should-fire correctness >= 0.5
+* should-not-fire correctness >= 0.5 (i.e. less than 50% of
+  near-miss queries leak through as `fire`)
+
+## Content eval scoring
+
+Each scenario ships two fixtures: one representing the agent's
+output WITH the skill loaded, one representing the same task
+WITHOUT it. The same regex rubric scores both. The reported
+`delta_anchors` is the count of rubric anchors that fire on
+`with_skill` but not on `without_skill`.
+
+Ship gate: every scenario has `delta_anchors >= 1`. A scenario
+with zero delta is a signal that the skill is not adding
+measurable value on that shape; the genesis discipline says
+redesign or delete. (This runner only flags; deletion is a
+human decision.)
+
+### Fixtures and the LLM-in-the-loop question
+
+The fixtures are pre-recorded representative outputs. They are
+NOT regenerated by an LLM at run time. Two reasons:
+
+1. **Determinism**: re-running the suite must produce identical
+   results so regressions are visible.
+2. **No required keys**: contributors should be able to run the
+   suite with only Python stdlib.
+
+When a maintainer materially changes SKILL.md, the fixtures
+SHOULD be regenerated by hand (or by running the agent once on
+each scenario, with and without the skill loaded, and pasting
+the outputs in). Document the regeneration in the PR that
+changes SKILL.md.
+
+## Adding a new eval
+
+* **New trigger eval**: append an entry to `triggers.json` with
+  a fresh `id`, the `query`, the `expected` outcome, and the
+  `split`. Keep the 60/40 ratio roughly intact.
+* **New content scenario**: create `content/<id>.json` (mirror
+  one of the existing files), add `fixtures/<id>__with_skill.md`
+  and `fixtures/<id>__without_skill.md`, and append the path to
+  `content_manifests` in `evals.json`.
+
+## Encoding
+
+All eval source files (Python, JSON, README) stay within
+printable ASCII per the repo-wide encoding rule. Fixtures are
+markdown intended to model PR-body output and MAY contain
+Unicode, mirroring the SKILL.md "Output charset rule" -- but the
+provided fixtures here remain ASCII for portability.
-All eval source files (Python, JSON, README) stay within
-printable ASCII per the repo-wide encoding rule. Fixtures are
-markdown intended to model PR-body output and MAY contain
-Unicode, mirroring the SKILL.md "Output charset rule" -- but the
-provided fixtures here remain ASCII for portability.
+All eval source files (Python, JSON, README, and fixture
+markdown) stay within printable ASCII per the repo-wide
+encoding rule. Although the fixtures model PR-body output, the
+tracked fixture files in this repo must also remain ASCII for
+portability and policy compliance.
-All eval source files (Python, JSON, README) stay within
-printable ASCII per the repo-wide encoding rule. Fixtures are
-markdown intended to model PR-body output and MAY contain
-Unicode, mirroring the SKILL.md "Output charset rule" -- but the
-provided fixtures here remain ASCII for portability.
+All eval source files (Python, JSON, README, and fixture
+markdown) stay within printable ASCII per the repo-wide
+encoding rule. Although the fixtures model PR-body output, the
+tracked fixture files in this repo must also remain ASCII for
+portability and policy compliance.
@@ -0,0 +1,43 @@
+{
+  "schema_version": 1,
+  "id": "auth-refactor",
+  "summary": "Cross-cutting auth refactor: token resolution moves into a single AuthResolver; multiple call sites updated; one breaking change to a public CLI flag. PR is non-trivial and exercises every required section.",
+  "scenario": {
+    "branch": "refactor/auth-resolver",
+    "base": "main",
+    "files_changed": [
+      "M src/apm_cli/auth/__init__.py",
+      "A src/apm_cli/auth/resolver.py",
+      "M src/apm_cli/cli.py",
+      "M src/apm_cli/integration/git.py",
+      "M tests/unit/auth/test_resolver.py",
+      "M CHANGELOG.md"
+    ],
+    "commits": [
+      "feat(auth): introduce AuthResolver as single source of token resolution",
+      "refactor(cli): replace per-host PAT lookup with AuthResolver.get(host)",
+      "fix(integration/git): drop legacy GITHUB_APM_PAT fallback path",
+      "docs(changelog): note breaking removal of --token-source flag"
+    ],
+    "linked_issue": "#812 -- token resolution scattered across modules; intermittent EMU lookups failing",
+    "validation_evidence": "uv run pytest tests/unit/auth -x  ->  142 passed in 3.4s\napm audit --ci  ->  0 findings"
+  },
+  "rubric": [
+    {"id": "tldr-present", "pattern": "(?im)^\\s*##?\\s*TL;DR\\b", "weight": 1, "description": "TL;DR section header present"},
+    {"id": "tldr-short", "pattern": "(?is)##?\\s*TL;DR.{1,800}?\\n##?\\s", "weight": 1, "description": "TL;DR is short (under 800 chars before next H2)"},
+    {"id": "problem-section", "pattern": "(?im)^\\s*##?\\s*Problem\\b", "weight": 1, "description": "Problem (WHY) section present"},
+    {"id": "anchored-quote", "pattern": "\\[\"[^\"]{8,}\"\\]\\(https?://", "weight": 2, "description": "At least one verbatim quote anchored to a URL"},
+    {"id": "approach-or-implementation", "pattern": "(?im)^\\s*##?\\s*(Approach|Implementation)\\b", "weight": 1, "description": "Approach or Implementation section present"},
+    {"id": "mermaid-block", "pattern": "(?s)```mermaid[\\s\\S]+?```", "weight": 2, "description": "At least one mermaid diagram block"},
+    {"id": "tradeoffs-section", "pattern": "(?im)^\\s*##?\\s*Trade-?offs?\\b", "weight": 1, "description": "Trade-offs section present"},
+    {"id": "validation-section", "pattern": "(?im)^\\s*##?\\s*Validation\\b", "weight": 1, "description": "Validation section present"},
+    {"id": "validation-real-output", "pattern": "(?im)pytest|apm audit|uv run", "weight": 1, "description": "Validation shows real command output"},
+    {"id": "how-to-test", "pattern": "(?im)^\\s*##?\\s*How to test\\b", "weight": 1, "description": "How to test section present"},
+    {"id": "trailer", "pattern": "Co-authored-by: Copilot <223556219", "weight": 1, "description": "Copilot co-author trailer present"},
+    {"id": "no-marketing-tone", "pattern": "(?i)significantly enhances|best-in-class|game-changing|revolutionary", "weight": -2, "description": "PENALTY: marketing tone detected"}
+  ],
+  "fixtures": {
+    "with_skill": "../fixtures/auth-refactor__with_skill.md",
+    "without_skill": "../fixtures/auth-refactor__without_skill.md"
+  }
+}
@@ -0,0 +1,37 @@
+{
+  "schema_version": 1,
+  "id": "dep-bump",
+  "summary": "Mechanical dependency upgrade: bump click from 8.1.7 to 8.2.1; lockfile-only changes plus a 1-line type annotation tweak. Tests Trade-offs may be 1-2 bullets and mermaid is optional.",
+  "scenario": {
+    "branch": "deps/click-8.2.1",
+    "base": "main",
+    "files_changed": [
+      "M pyproject.toml",
+      "M uv.lock",
+      "M src/apm_cli/cli.py"
+    ],
+    "commits": [
+      "chore(deps): bump click 8.1.7 -> 8.2.1",
+      "chore(cli): adjust type hint for new click.Context generic"
+    ],
+    "linked_issue": "(none)",
+    "validation_evidence": "uv sync --extra dev  ->  resolved in 2.1s\nuv run pytest tests/unit -x  ->  2418 passed in 38.7s"
+  },
+  "rubric": [
+    {"id": "tldr-present", "pattern": "(?im)^\\s*##?\\s*TL;DR\\b", "weight": 1, "description": "TL;DR present"},
+    {"id": "tldr-short", "pattern": "(?is)##?\\s*TL;DR.{1,500}?\\n##?\\s", "weight": 1, "description": "TL;DR very short on a mechanical PR"},
+    {"id": "problem-section", "pattern": "(?im)^\\s*##?\\s*Problem\\b", "weight": 1, "description": "Problem section present"},
+    {"id": "implementation-section", "pattern": "(?im)^\\s*##?\\s*Implementation\\b", "weight": 1, "description": "Implementation section present"},
+    {"id": "tradeoffs-section", "pattern": "(?im)^\\s*##?\\s*Trade-?offs?\\b", "weight": 1, "description": "Trade-offs section present (may be 1-2 bullets)"},
+    {"id": "validation-section", "pattern": "(?im)^\\s*##?\\s*Validation\\b", "weight": 1, "description": "Validation section present"},
+    {"id": "validation-real-output", "pattern": "(?im)pytest|uv sync|2418 passed", "weight": 1, "description": "Real validation output (pytest counts) shown"},
+    {"id": "how-to-test", "pattern": "(?im)^\\s*##?\\s*How to test\\b", "weight": 1, "description": "How to test section present"},
+    {"id": "trailer", "pattern": "Co-authored-by: Copilot <223556219", "weight": 1, "description": "Copilot co-author trailer present"},
+    {"id": "no-marketing-tone", "pattern": "(?i)significantly enhances|best-in-class|game-changing", "weight": -2, "description": "PENALTY: marketing tone detected"},
+    {"id": "no-diff-restate", "pattern": "(?im)^[+-]{3} ", "weight": -1, "description": "PENALTY: diff lines restated in body"}
+  ],
+  "fixtures": {
+    "with_skill": "../fixtures/dep-bump__with_skill.md",
+    "without_skill": "../fixtures/dep-bump__without_skill.md"
+  }
+}
@@ -0,0 +1,34 @@
+{
+  "schema_version": 1,
+  "id": "docs-only",
+  "summary": "Markdown-only PR adding a quickstart guide. The skill must keep TL;DR, Problem, Validation, How-to-test sections even though the PR is 'trivial' (per SKILL.md gotcha).",
+  "scenario": {
+    "branch": "docs/quickstart",
+    "base": "main",
+    "files_changed": [
+      "A docs/src/content/docs/guides/quickstart.md",
+      "M docs/src/content/docs/index.md"
+    ],
+    "commits": [
+      "docs(quickstart): add 5-minute getting-started guide",
+      "docs(index): link quickstart from landing page"
+    ],
+    "linked_issue": "#640 -- new contributors bounce after install; need a 5-minute happy path",
+    "validation_evidence": "npm run --prefix docs build  ->  built in 4.2s, no broken links\napm audit --ci  ->  0 findings"
+  },
+  "rubric": [
+    {"id": "tldr-present", "pattern": "(?im)^\\s*##?\\s*TL;DR\\b", "weight": 1, "description": "TL;DR present even on docs-only PR"},
+    {"id": "problem-section", "pattern": "(?im)^\\s*##?\\s*Problem\\b", "weight": 1, "description": "Problem section present even on docs-only PR (per SKILL.md gotcha)"},
+    {"id": "anchored-quote", "pattern": "\\[\"[^\"]{8,}\"\\]\\(https?://", "weight": 2, "description": "At least one verbatim anchored quote"},
+    {"id": "validation-section", "pattern": "(?im)^\\s*##?\\s*Validation\\b", "weight": 1, "description": "Validation section present"},
+    {"id": "validation-build", "pattern": "(?im)npm run|build|broken link", "weight": 1, "description": "Real validation evidence (build/link check) shown"},
+    {"id": "how-to-test", "pattern": "(?im)^\\s*##?\\s*How to test\\b", "weight": 1, "description": "How to test present"},
+    {"id": "trailer", "pattern": "Co-authored-by: Copilot <223556219", "weight": 1, "description": "Copilot co-author trailer present"},
+    {"id": "no-trivial-skip", "pattern": "(?i)the PR is trivial|too small to need", "weight": -2, "description": "PENALTY: skipping sections because PR is small"},
+    {"id": "no-marketing-tone", "pattern": "(?i)significantly enhances|best-in-class|game-changing", "weight": -2, "description": "PENALTY: marketing tone detected"}
+  ],
+  "fixtures": {
+    "with_skill": "../fixtures/docs-only__with_skill.md",
+    "without_skill": "../fixtures/docs-only__without_skill.md"
+  }
+}
@@ -0,0 +1,62 @@
+{
+  "schema_version": 1,
+  "skill": "pr-description-skill",
+  "skill_path": "../SKILL.md",
+  "triggers_manifest": "triggers.json",
+  "content_manifests": [
+    "content/auth-refactor.json",
+    "content/docs-only.json",
+    "content/dep-bump.json"
+  ],
+  "gates": {
+    "triggers": {
+      "split": "val",
+      "should_fire_rate_min": 0.5,
+      "should_not_fire_rate_max": 0.5
-      "should_not_fire_rate_max": 0.5
+      "should_not_fire_miss_rate_max": 0.5
-      "should_not_fire_rate_max": 0.5
+      "should_not_fire_miss_rate_max": 0.5
+    },
+    "content": {
+      "delta_min_anchors": 1,
+      "scope": "every_scenario"
+    }
+  },
+  "stop_list": [
+    "commit message",
+    "release note",
+    "release notes",
+    "changelog",
+    "open an issue",
+    "open issue",
+    "review this pr",
+    "code review comment",
+    "design doc",
+    "summarize the diff"
+  ],
+  "trigger_keywords_primary": [
+    "pr description",
+    "pr body",
+    "pull request description",
+    "pull request body",
+    "pr template",
+    "pr write-up",
+    "pr writeup",
+    "draft the pr",
+    "draft a pr",
+    "open the pr",
+    "open a pr",
+    "summarize this branch as a pr",
+    "fill in the pr template"
+  ],
+  "trigger_keywords_secondary": [
+    "pr",
+    "pull",
+    "request",
+    "description",
+    "body",
+    "draft",
+    "open",
+    "write",
+    "fill",
+    "summarize",
+    "branch"
+  ]
+}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# Ignore generated result files; track only the .gitkeep sentinel.
		results/*.json