Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .apm/skills/pr-description-skill/evals/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Ignore generated result files; track only the .gitkeep sentinel.
results/*.json
132 changes: 132 additions & 0 deletions .apm/skills/pr-description-skill/evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# pr-description-skill evals

This bundle answers two questions deterministically and without
requiring an LLM API key:

1. **TRIGGER EVALS**: does the SKILL.md `description:` reliably
match real should-fire intents and avoid near-miss queries?
2. **CONTENT EVALS**: does loading the SKILL.md body change the
shape of a PR description an agent produces, vs not loading
it?

## Layout

```
evals/
evals.json # top-level manifest (gates, keyword lists)
triggers.json # 18 trigger items (9 fire / 9 no-fire),
# ~60/40 train/val split
content/
auth-refactor.json # cross-cutting refactor scenario + rubric
docs-only.json # docs-only PR scenario + rubric
dep-bump.json # mechanical dep bump scenario + rubric
fixtures/
<id>__with_skill.md # representative output produced
# under the SKILL.md guidance
<id>__without_skill.md # representative output produced
# without the skill loaded
results/
.gitkeep # tracked sentinel
<UTC-iso>.json # one file per run (gitignored)
README.md # this file
```

The runner script lives at
`.apm/skills/pr-description-skill/scripts/run_evals.py`.

## Run

From the repo root (or this skill's directory):

```
python .apm/skills/pr-description-skill/scripts/run_evals.py
```

Common options:

| Flag | Effect |
|---|---|
| `--filter triggers` | Run only the trigger evals. |
| `--filter content` | Run only the content evals. |
| `--split train` | Score the train split for triggers (default is `val`, the ship gate). |
| `--split all` | Score both splits. |
| `--no-write` | Do not write to `evals/results/`. |
| `--quiet` | Suppress stderr diagnostics. |

Exit codes:

* `0` = all gates met
* `1` = one or more gates failed
* `2` = runner error (missing manifest, parse error, missing fixture)

## Trigger eval scoring

The runner uses a deterministic dispatcher approximation defined in
`evals.json`:

1. If any phrase from `stop_list` appears verbatim in the query
(lowercased), predict `no_fire`.
2. Otherwise if any phrase from `trigger_keywords_primary` appears
verbatim, predict `fire`.
3. Otherwise count distinct tokens from
`trigger_keywords_secondary`; predict `fire` iff at least 3
distinct tokens AND one of `{pr, pull}` is present.

This is NOT a perfect proxy for an actual LLM dispatcher. It is
fast, deterministic, and CI-friendly. When `gh models` becomes
available, an `--llm` flag can be added that calls the real
dispatcher; the manifest schema already accommodates it.

Ship gate (validation split):

* should-fire correctness >= 0.5
* should-not-fire correctness >= 0.5 (i.e. less than 50% of
near-miss queries leak through as `fire`)

## Content eval scoring

Each scenario ships two fixtures: one representing the agent's
output WITH the skill loaded, one representing the same task
WITHOUT it. The same regex rubric scores both. The reported
`delta_anchors` is the count of rubric anchors that fire on
`with_skill` but not on `without_skill`.

Ship gate: every scenario has `delta_anchors >= 1`. A scenario
with zero delta is a signal that the skill is not adding
measurable value on that shape; the genesis discipline says
redesign or delete. (This runner only flags; deletion is a
human decision.)

### Fixtures and the LLM-in-the-loop question

The fixtures are pre-recorded representative outputs. They are
NOT regenerated by an LLM at run time. Two reasons:

1. **Determinism**: re-running the suite must produce identical
results so regressions are visible.
2. **No required keys**: contributors should be able to run the
suite with only Python stdlib.

When a maintainer materially changes SKILL.md, the fixtures
SHOULD be regenerated by hand (or by running the agent once on
each scenario, with and without the skill loaded, and pasting
the outputs in). Document the regeneration in the PR that
changes SKILL.md.

## Adding a new eval

* **New trigger eval**: append an entry to `triggers.json` with
a fresh `id`, the `query`, the `expected` outcome, and the
`split`. Keep the 60/40 ratio roughly intact.
* **New content scenario**: create `content/<id>.json` (mirror
one of the existing files), add `fixtures/<id>__with_skill.md`
and `fixtures/<id>__without_skill.md`, and append the path to
`content_manifests` in `evals.json`.

## Encoding

All eval source files (Python, JSON, README) stay within
printable ASCII per the repo-wide encoding rule. Fixtures are
markdown intended to model PR-body output and MAY contain
Unicode, mirroring the SKILL.md "Output charset rule" -- but the
provided fixtures here remain ASCII for portability.
Comment on lines +128 to +132
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README says fixtures "MAY contain Unicode". These fixture files are tracked source files in the repo (and under .github/ in the regenerated mirror), so they still fall under the repo-wide ASCII-only encoding rule (see .github/instructions/encoding.instructions.md). Please update this section to avoid encouraging future non-ASCII fixture content.

Suggested change
All eval source files (Python, JSON, README) stay within
printable ASCII per the repo-wide encoding rule. Fixtures are
markdown intended to model PR-body output and MAY contain
Unicode, mirroring the SKILL.md "Output charset rule" -- but the
provided fixtures here remain ASCII for portability.
All eval source files (Python, JSON, README, and fixture
markdown) stay within printable ASCII per the repo-wide
encoding rule. Although the fixtures model PR-body output, the
tracked fixture files in this repo must also remain ASCII for
portability and policy compliance.

Copilot uses AI. Check for mistakes.
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
{
"schema_version": 1,
"id": "auth-refactor",
"summary": "Cross-cutting auth refactor: token resolution moves into a single AuthResolver; multiple call sites updated; one breaking change to a public CLI flag. PR is non-trivial and exercises every required section.",
"scenario": {
"branch": "refactor/auth-resolver",
"base": "main",
"files_changed": [
"M src/apm_cli/auth/__init__.py",
"A src/apm_cli/auth/resolver.py",
"M src/apm_cli/cli.py",
"M src/apm_cli/integration/git.py",
"M tests/unit/auth/test_resolver.py",
"M CHANGELOG.md"
],
"commits": [
"feat(auth): introduce AuthResolver as single source of token resolution",
"refactor(cli): replace per-host PAT lookup with AuthResolver.get(host)",
"fix(integration/git): drop legacy GITHUB_APM_PAT fallback path",
"docs(changelog): note breaking removal of --token-source flag"
],
"linked_issue": "#812 -- token resolution scattered across modules; intermittent EMU lookups failing",
"validation_evidence": "uv run pytest tests/unit/auth -x -> 142 passed in 3.4s\napm audit --ci -> 0 findings"
},
"rubric": [
{"id": "tldr-present", "pattern": "(?im)^\\s*##?\\s*TL;DR\\b", "weight": 1, "description": "TL;DR section header present"},
{"id": "tldr-short", "pattern": "(?is)##?\\s*TL;DR.{1,800}?\\n##?\\s", "weight": 1, "description": "TL;DR is short (under 800 chars before next H2)"},
{"id": "problem-section", "pattern": "(?im)^\\s*##?\\s*Problem\\b", "weight": 1, "description": "Problem (WHY) section present"},
{"id": "anchored-quote", "pattern": "\\[\"[^\"]{8,}\"\\]\\(https?://", "weight": 2, "description": "At least one verbatim quote anchored to a URL"},
{"id": "approach-or-implementation", "pattern": "(?im)^\\s*##?\\s*(Approach|Implementation)\\b", "weight": 1, "description": "Approach or Implementation section present"},
{"id": "mermaid-block", "pattern": "(?s)```mermaid[\\s\\S]+?```", "weight": 2, "description": "At least one mermaid diagram block"},
{"id": "tradeoffs-section", "pattern": "(?im)^\\s*##?\\s*Trade-?offs?\\b", "weight": 1, "description": "Trade-offs section present"},
{"id": "validation-section", "pattern": "(?im)^\\s*##?\\s*Validation\\b", "weight": 1, "description": "Validation section present"},
{"id": "validation-real-output", "pattern": "(?im)pytest|apm audit|uv run", "weight": 1, "description": "Validation shows real command output"},
{"id": "how-to-test", "pattern": "(?im)^\\s*##?\\s*How to test\\b", "weight": 1, "description": "How to test section present"},
{"id": "trailer", "pattern": "Co-authored-by: Copilot <223556219", "weight": 1, "description": "Copilot co-author trailer present"},
{"id": "no-marketing-tone", "pattern": "(?i)significantly enhances|best-in-class|game-changing|revolutionary", "weight": -2, "description": "PENALTY: marketing tone detected"}
],
"fixtures": {
"with_skill": "../fixtures/auth-refactor__with_skill.md",
"without_skill": "../fixtures/auth-refactor__without_skill.md"
}
}
37 changes: 37 additions & 0 deletions .apm/skills/pr-description-skill/evals/content/dep-bump.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"schema_version": 1,
"id": "dep-bump",
"summary": "Mechanical dependency upgrade: bump click from 8.1.7 to 8.2.1; lockfile-only changes plus a 1-line type annotation tweak. Tests Trade-offs may be 1-2 bullets and mermaid is optional.",
"scenario": {
"branch": "deps/click-8.2.1",
"base": "main",
"files_changed": [
"M pyproject.toml",
"M uv.lock",
"M src/apm_cli/cli.py"
],
"commits": [
"chore(deps): bump click 8.1.7 -> 8.2.1",
"chore(cli): adjust type hint for new click.Context generic"
],
"linked_issue": "(none)",
"validation_evidence": "uv sync --extra dev -> resolved in 2.1s\nuv run pytest tests/unit -x -> 2418 passed in 38.7s"
},
"rubric": [
{"id": "tldr-present", "pattern": "(?im)^\\s*##?\\s*TL;DR\\b", "weight": 1, "description": "TL;DR present"},
{"id": "tldr-short", "pattern": "(?is)##?\\s*TL;DR.{1,500}?\\n##?\\s", "weight": 1, "description": "TL;DR very short on a mechanical PR"},
{"id": "problem-section", "pattern": "(?im)^\\s*##?\\s*Problem\\b", "weight": 1, "description": "Problem section present"},
{"id": "implementation-section", "pattern": "(?im)^\\s*##?\\s*Implementation\\b", "weight": 1, "description": "Implementation section present"},
{"id": "tradeoffs-section", "pattern": "(?im)^\\s*##?\\s*Trade-?offs?\\b", "weight": 1, "description": "Trade-offs section present (may be 1-2 bullets)"},
{"id": "validation-section", "pattern": "(?im)^\\s*##?\\s*Validation\\b", "weight": 1, "description": "Validation section present"},
{"id": "validation-real-output", "pattern": "(?im)pytest|uv sync|2418 passed", "weight": 1, "description": "Real validation output (pytest counts) shown"},
{"id": "how-to-test", "pattern": "(?im)^\\s*##?\\s*How to test\\b", "weight": 1, "description": "How to test section present"},
{"id": "trailer", "pattern": "Co-authored-by: Copilot <223556219", "weight": 1, "description": "Copilot co-author trailer present"},
{"id": "no-marketing-tone", "pattern": "(?i)significantly enhances|best-in-class|game-changing", "weight": -2, "description": "PENALTY: marketing tone detected"},
{"id": "no-diff-restate", "pattern": "(?im)^[+-]{3} ", "weight": -1, "description": "PENALTY: diff lines restated in body"}
],
"fixtures": {
"with_skill": "../fixtures/dep-bump__with_skill.md",
"without_skill": "../fixtures/dep-bump__without_skill.md"
}
}
34 changes: 34 additions & 0 deletions .apm/skills/pr-description-skill/evals/content/docs-only.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"schema_version": 1,
"id": "docs-only",
"summary": "Markdown-only PR adding a quickstart guide. The skill must keep TL;DR, Problem, Validation, How-to-test sections even though the PR is 'trivial' (per SKILL.md gotcha).",
"scenario": {
"branch": "docs/quickstart",
"base": "main",
"files_changed": [
"A docs/src/content/docs/guides/quickstart.md",
"M docs/src/content/docs/index.md"
],
"commits": [
"docs(quickstart): add 5-minute getting-started guide",
"docs(index): link quickstart from landing page"
],
"linked_issue": "#640 -- new contributors bounce after install; need a 5-minute happy path",
"validation_evidence": "npm run --prefix docs build -> built in 4.2s, no broken links\napm audit --ci -> 0 findings"
},
"rubric": [
{"id": "tldr-present", "pattern": "(?im)^\\s*##?\\s*TL;DR\\b", "weight": 1, "description": "TL;DR present even on docs-only PR"},
{"id": "problem-section", "pattern": "(?im)^\\s*##?\\s*Problem\\b", "weight": 1, "description": "Problem section present even on docs-only PR (per SKILL.md gotcha)"},
{"id": "anchored-quote", "pattern": "\\[\"[^\"]{8,}\"\\]\\(https?://", "weight": 2, "description": "At least one verbatim anchored quote"},
{"id": "validation-section", "pattern": "(?im)^\\s*##?\\s*Validation\\b", "weight": 1, "description": "Validation section present"},
{"id": "validation-build", "pattern": "(?im)npm run|build|broken link", "weight": 1, "description": "Real validation evidence (build/link check) shown"},
{"id": "how-to-test", "pattern": "(?im)^\\s*##?\\s*How to test\\b", "weight": 1, "description": "How to test present"},
{"id": "trailer", "pattern": "Co-authored-by: Copilot <223556219", "weight": 1, "description": "Copilot co-author trailer present"},
{"id": "no-trivial-skip", "pattern": "(?i)the PR is trivial|too small to need", "weight": -2, "description": "PENALTY: skipping sections because PR is small"},
{"id": "no-marketing-tone", "pattern": "(?i)significantly enhances|best-in-class|game-changing", "weight": -2, "description": "PENALTY: marketing tone detected"}
],
"fixtures": {
"with_skill": "../fixtures/docs-only__with_skill.md",
"without_skill": "../fixtures/docs-only__without_skill.md"
}
}
62 changes: 62 additions & 0 deletions .apm/skills/pr-description-skill/evals/evals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
{
"schema_version": 1,
"skill": "pr-description-skill",
"skill_path": "../SKILL.md",
"triggers_manifest": "triggers.json",
"content_manifests": [
"content/auth-refactor.json",
"content/docs-only.json",
"content/dep-bump.json"
],
"gates": {
"triggers": {
"split": "val",
"should_fire_rate_min": 0.5,
"should_not_fire_rate_max": 0.5
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The triggers gate key should_not_fire_rate_max reads like a maximum correctness rate, but the runner treats it as a maximum miss/leak rate (see output field max_miss_rate). Rename this key to reflect its semantics (e.g. should_not_fire_miss_rate_max) or adjust the runner logic so the name and behavior match.

Suggested change
"should_not_fire_rate_max": 0.5
"should_not_fire_miss_rate_max": 0.5

Copilot uses AI. Check for mistakes.
},
"content": {
"delta_min_anchors": 1,
"scope": "every_scenario"
}
},
"stop_list": [
"commit message",
"release note",
"release notes",
"changelog",
"open an issue",
"open issue",
"review this pr",
"code review comment",
"design doc",
"summarize the diff"
],
"trigger_keywords_primary": [
"pr description",
"pr body",
"pull request description",
"pull request body",
"pr template",
"pr write-up",
"pr writeup",
"draft the pr",
"draft a pr",
"open the pr",
"open a pr",
"summarize this branch as a pr",
"fill in the pr template"
],
"trigger_keywords_secondary": [
"pr",
"pull",
"request",
"description",
"body",
"draft",
"open",
"write",
"fill",
"summarize",
"branch"
]
}
Loading
Loading