-
Notifications
You must be signed in to change notification settings - Fork 155
Add evals suite to pr-description-skill (genesis-driven) #985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| # Ignore generated result files; track only the .gitkeep sentinel. | ||
| results/*.json |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,132 @@ | ||
| # pr-description-skill evals | ||
|
|
||
| This bundle answers two questions deterministically and without | ||
| requiring an LLM API key: | ||
|
|
||
| 1. **TRIGGER EVALS**: does the SKILL.md `description:` reliably | ||
| match real should-fire intents and avoid near-miss queries? | ||
| 2. **CONTENT EVALS**: does loading the SKILL.md body change the | ||
| shape of a PR description an agent produces, vs not loading | ||
| it? | ||
|
|
||
| ## Layout | ||
|
|
||
| ``` | ||
| evals/ | ||
| evals.json # top-level manifest (gates, keyword lists) | ||
| triggers.json # 18 trigger items (9 fire / 9 no-fire), | ||
| # ~60/40 train/val split | ||
| content/ | ||
| auth-refactor.json # cross-cutting refactor scenario + rubric | ||
| docs-only.json # docs-only PR scenario + rubric | ||
| dep-bump.json # mechanical dep bump scenario + rubric | ||
| fixtures/ | ||
| <id>__with_skill.md # representative output produced | ||
| # under the SKILL.md guidance | ||
| <id>__without_skill.md # representative output produced | ||
| # without the skill loaded | ||
| results/ | ||
| .gitkeep # tracked sentinel | ||
| <UTC-iso>.json # one file per run (gitignored) | ||
| README.md # this file | ||
| ``` | ||
|
|
||
| The runner script lives at | ||
| `.apm/skills/pr-description-skill/scripts/run_evals.py`. | ||
|
|
||
| ## Run | ||
|
|
||
| From the repo root (or this skill's directory): | ||
|
|
||
| ``` | ||
| python .apm/skills/pr-description-skill/scripts/run_evals.py | ||
| ``` | ||
|
|
||
| Common options: | ||
|
|
||
| | Flag | Effect | | ||
| |---|---| | ||
| | `--filter triggers` | Run only the trigger evals. | | ||
| | `--filter content` | Run only the content evals. | | ||
| | `--split train` | Score the train split for triggers (default is `val`, the ship gate). | | ||
| | `--split all` | Score both splits. | | ||
| | `--no-write` | Do not write to `evals/results/`. | | ||
| | `--quiet` | Suppress stderr diagnostics. | | ||
|
|
||
| Exit codes: | ||
|
|
||
| * `0` = all gates met | ||
| * `1` = one or more gates failed | ||
| * `2` = runner error (missing manifest, parse error, missing fixture) | ||
|
|
||
| ## Trigger eval scoring | ||
|
|
||
| The runner uses a deterministic dispatcher approximation defined in | ||
| `evals.json`: | ||
|
|
||
| 1. If any phrase from `stop_list` appears verbatim in the query | ||
| (lowercased), predict `no_fire`. | ||
| 2. Otherwise if any phrase from `trigger_keywords_primary` appears | ||
| verbatim, predict `fire`. | ||
| 3. Otherwise count distinct tokens from | ||
| `trigger_keywords_secondary`; predict `fire` iff at least 3 | ||
| distinct tokens AND one of `{pr, pull}` is present. | ||
|
|
||
| This is NOT a perfect proxy for an actual LLM dispatcher. It is | ||
| fast, deterministic, and CI-friendly. When `gh models` becomes | ||
| available, an `--llm` flag can be added that calls the real | ||
| dispatcher; the manifest schema already accommodates it. | ||
|
|
||
| Ship gate (validation split): | ||
|
|
||
| * should-fire correctness >= 0.5 | ||
| * should-not-fire correctness >= 0.5 (i.e. less than 50% of | ||
| near-miss queries leak through as `fire`) | ||
|
|
||
| ## Content eval scoring | ||
|
|
||
| Each scenario ships two fixtures: one representing the agent's | ||
| output WITH the skill loaded, one representing the same task | ||
| WITHOUT it. The same regex rubric scores both. The reported | ||
| `delta_anchors` is the count of rubric anchors that fire on | ||
| `with_skill` but not on `without_skill`. | ||
|
|
||
| Ship gate: every scenario has `delta_anchors >= 1`. A scenario | ||
| with zero delta is a signal that the skill is not adding | ||
| measurable value on that shape; the genesis discipline says | ||
| redesign or delete. (This runner only flags; deletion is a | ||
| human decision.) | ||
|
|
||
| ### Fixtures and the LLM-in-the-loop question | ||
|
|
||
| The fixtures are pre-recorded representative outputs. They are | ||
| NOT regenerated by an LLM at run time. Two reasons: | ||
|
|
||
| 1. **Determinism**: re-running the suite must produce identical | ||
| results so regressions are visible. | ||
| 2. **No required keys**: contributors should be able to run the | ||
| suite with only Python stdlib. | ||
|
|
||
| When a maintainer materially changes SKILL.md, the fixtures | ||
| SHOULD be regenerated by hand (or by running the agent once on | ||
| each scenario, with and without the skill loaded, and pasting | ||
| the outputs in). Document the regeneration in the PR that | ||
| changes SKILL.md. | ||
|
|
||
| ## Adding a new eval | ||
|
|
||
| * **New trigger eval**: append an entry to `triggers.json` with | ||
| a fresh `id`, the `query`, the `expected` outcome, and the | ||
| `split`. Keep the 60/40 ratio roughly intact. | ||
| * **New content scenario**: create `content/<id>.json` (mirror | ||
| one of the existing files), add `fixtures/<id>__with_skill.md` | ||
| and `fixtures/<id>__without_skill.md`, and append the path to | ||
| `content_manifests` in `evals.json`. | ||
|
|
||
| ## Encoding | ||
|
|
||
| All eval source files (Python, JSON, README) stay within | ||
| printable ASCII per the repo-wide encoding rule. Fixtures are | ||
| markdown intended to model PR-body output and MAY contain | ||
| Unicode, mirroring the SKILL.md "Output charset rule" -- but the | ||
| provided fixtures here remain ASCII for portability. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| { | ||
| "schema_version": 1, | ||
| "id": "auth-refactor", | ||
| "summary": "Cross-cutting auth refactor: token resolution moves into a single AuthResolver; multiple call sites updated; one breaking change to a public CLI flag. PR is non-trivial and exercises every required section.", | ||
| "scenario": { | ||
| "branch": "refactor/auth-resolver", | ||
| "base": "main", | ||
| "files_changed": [ | ||
| "M src/apm_cli/auth/__init__.py", | ||
| "A src/apm_cli/auth/resolver.py", | ||
| "M src/apm_cli/cli.py", | ||
| "M src/apm_cli/integration/git.py", | ||
| "M tests/unit/auth/test_resolver.py", | ||
| "M CHANGELOG.md" | ||
| ], | ||
| "commits": [ | ||
| "feat(auth): introduce AuthResolver as single source of token resolution", | ||
| "refactor(cli): replace per-host PAT lookup with AuthResolver.get(host)", | ||
| "fix(integration/git): drop legacy GITHUB_APM_PAT fallback path", | ||
| "docs(changelog): note breaking removal of --token-source flag" | ||
| ], | ||
| "linked_issue": "#812 -- token resolution scattered across modules; intermittent EMU lookups failing", | ||
| "validation_evidence": "uv run pytest tests/unit/auth -x -> 142 passed in 3.4s\napm audit --ci -> 0 findings" | ||
| }, | ||
| "rubric": [ | ||
| {"id": "tldr-present", "pattern": "(?im)^\\s*##?\\s*TL;DR\\b", "weight": 1, "description": "TL;DR section header present"}, | ||
| {"id": "tldr-short", "pattern": "(?is)##?\\s*TL;DR.{1,800}?\\n##?\\s", "weight": 1, "description": "TL;DR is short (under 800 chars before next H2)"}, | ||
| {"id": "problem-section", "pattern": "(?im)^\\s*##?\\s*Problem\\b", "weight": 1, "description": "Problem (WHY) section present"}, | ||
| {"id": "anchored-quote", "pattern": "\\[\"[^\"]{8,}\"\\]\\(https?://", "weight": 2, "description": "At least one verbatim quote anchored to a URL"}, | ||
| {"id": "approach-or-implementation", "pattern": "(?im)^\\s*##?\\s*(Approach|Implementation)\\b", "weight": 1, "description": "Approach or Implementation section present"}, | ||
| {"id": "mermaid-block", "pattern": "(?s)```mermaid[\\s\\S]+?```", "weight": 2, "description": "At least one mermaid diagram block"}, | ||
| {"id": "tradeoffs-section", "pattern": "(?im)^\\s*##?\\s*Trade-?offs?\\b", "weight": 1, "description": "Trade-offs section present"}, | ||
| {"id": "validation-section", "pattern": "(?im)^\\s*##?\\s*Validation\\b", "weight": 1, "description": "Validation section present"}, | ||
| {"id": "validation-real-output", "pattern": "(?im)pytest|apm audit|uv run", "weight": 1, "description": "Validation shows real command output"}, | ||
| {"id": "how-to-test", "pattern": "(?im)^\\s*##?\\s*How to test\\b", "weight": 1, "description": "How to test section present"}, | ||
| {"id": "trailer", "pattern": "Co-authored-by: Copilot <223556219", "weight": 1, "description": "Copilot co-author trailer present"}, | ||
| {"id": "no-marketing-tone", "pattern": "(?i)significantly enhances|best-in-class|game-changing|revolutionary", "weight": -2, "description": "PENALTY: marketing tone detected"} | ||
| ], | ||
| "fixtures": { | ||
| "with_skill": "../fixtures/auth-refactor__with_skill.md", | ||
| "without_skill": "../fixtures/auth-refactor__without_skill.md" | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| { | ||
| "schema_version": 1, | ||
| "id": "dep-bump", | ||
| "summary": "Mechanical dependency upgrade: bump click from 8.1.7 to 8.2.1; lockfile-only changes plus a 1-line type annotation tweak. Tests Trade-offs may be 1-2 bullets and mermaid is optional.", | ||
| "scenario": { | ||
| "branch": "deps/click-8.2.1", | ||
| "base": "main", | ||
| "files_changed": [ | ||
| "M pyproject.toml", | ||
| "M uv.lock", | ||
| "M src/apm_cli/cli.py" | ||
| ], | ||
| "commits": [ | ||
| "chore(deps): bump click 8.1.7 -> 8.2.1", | ||
| "chore(cli): adjust type hint for new click.Context generic" | ||
| ], | ||
| "linked_issue": "(none)", | ||
| "validation_evidence": "uv sync --extra dev -> resolved in 2.1s\nuv run pytest tests/unit -x -> 2418 passed in 38.7s" | ||
| }, | ||
| "rubric": [ | ||
| {"id": "tldr-present", "pattern": "(?im)^\\s*##?\\s*TL;DR\\b", "weight": 1, "description": "TL;DR present"}, | ||
| {"id": "tldr-short", "pattern": "(?is)##?\\s*TL;DR.{1,500}?\\n##?\\s", "weight": 1, "description": "TL;DR very short on a mechanical PR"}, | ||
| {"id": "problem-section", "pattern": "(?im)^\\s*##?\\s*Problem\\b", "weight": 1, "description": "Problem section present"}, | ||
| {"id": "implementation-section", "pattern": "(?im)^\\s*##?\\s*Implementation\\b", "weight": 1, "description": "Implementation section present"}, | ||
| {"id": "tradeoffs-section", "pattern": "(?im)^\\s*##?\\s*Trade-?offs?\\b", "weight": 1, "description": "Trade-offs section present (may be 1-2 bullets)"}, | ||
| {"id": "validation-section", "pattern": "(?im)^\\s*##?\\s*Validation\\b", "weight": 1, "description": "Validation section present"}, | ||
| {"id": "validation-real-output", "pattern": "(?im)pytest|uv sync|2418 passed", "weight": 1, "description": "Real validation output (pytest counts) shown"}, | ||
| {"id": "how-to-test", "pattern": "(?im)^\\s*##?\\s*How to test\\b", "weight": 1, "description": "How to test section present"}, | ||
| {"id": "trailer", "pattern": "Co-authored-by: Copilot <223556219", "weight": 1, "description": "Copilot co-author trailer present"}, | ||
| {"id": "no-marketing-tone", "pattern": "(?i)significantly enhances|best-in-class|game-changing", "weight": -2, "description": "PENALTY: marketing tone detected"}, | ||
| {"id": "no-diff-restate", "pattern": "(?im)^[+-]{3} ", "weight": -1, "description": "PENALTY: diff lines restated in body"} | ||
| ], | ||
| "fixtures": { | ||
| "with_skill": "../fixtures/dep-bump__with_skill.md", | ||
| "without_skill": "../fixtures/dep-bump__without_skill.md" | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| { | ||
| "schema_version": 1, | ||
| "id": "docs-only", | ||
| "summary": "Markdown-only PR adding a quickstart guide. The skill must keep TL;DR, Problem, Validation, How-to-test sections even though the PR is 'trivial' (per SKILL.md gotcha).", | ||
| "scenario": { | ||
| "branch": "docs/quickstart", | ||
| "base": "main", | ||
| "files_changed": [ | ||
| "A docs/src/content/docs/guides/quickstart.md", | ||
| "M docs/src/content/docs/index.md" | ||
| ], | ||
| "commits": [ | ||
| "docs(quickstart): add 5-minute getting-started guide", | ||
| "docs(index): link quickstart from landing page" | ||
| ], | ||
| "linked_issue": "#640 -- new contributors bounce after install; need a 5-minute happy path", | ||
| "validation_evidence": "npm run --prefix docs build -> built in 4.2s, no broken links\napm audit --ci -> 0 findings" | ||
| }, | ||
| "rubric": [ | ||
| {"id": "tldr-present", "pattern": "(?im)^\\s*##?\\s*TL;DR\\b", "weight": 1, "description": "TL;DR present even on docs-only PR"}, | ||
| {"id": "problem-section", "pattern": "(?im)^\\s*##?\\s*Problem\\b", "weight": 1, "description": "Problem section present even on docs-only PR (per SKILL.md gotcha)"}, | ||
| {"id": "anchored-quote", "pattern": "\\[\"[^\"]{8,}\"\\]\\(https?://", "weight": 2, "description": "At least one verbatim anchored quote"}, | ||
| {"id": "validation-section", "pattern": "(?im)^\\s*##?\\s*Validation\\b", "weight": 1, "description": "Validation section present"}, | ||
| {"id": "validation-build", "pattern": "(?im)npm run|build|broken link", "weight": 1, "description": "Real validation evidence (build/link check) shown"}, | ||
| {"id": "how-to-test", "pattern": "(?im)^\\s*##?\\s*How to test\\b", "weight": 1, "description": "How to test present"}, | ||
| {"id": "trailer", "pattern": "Co-authored-by: Copilot <223556219", "weight": 1, "description": "Copilot co-author trailer present"}, | ||
| {"id": "no-trivial-skip", "pattern": "(?i)the PR is trivial|too small to need", "weight": -2, "description": "PENALTY: skipping sections because PR is small"}, | ||
| {"id": "no-marketing-tone", "pattern": "(?i)significantly enhances|best-in-class|game-changing", "weight": -2, "description": "PENALTY: marketing tone detected"} | ||
| ], | ||
| "fixtures": { | ||
| "with_skill": "../fixtures/docs-only__with_skill.md", | ||
| "without_skill": "../fixtures/docs-only__without_skill.md" | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,62 @@ | ||||||
| { | ||||||
| "schema_version": 1, | ||||||
| "skill": "pr-description-skill", | ||||||
| "skill_path": "../SKILL.md", | ||||||
| "triggers_manifest": "triggers.json", | ||||||
| "content_manifests": [ | ||||||
| "content/auth-refactor.json", | ||||||
| "content/docs-only.json", | ||||||
| "content/dep-bump.json" | ||||||
| ], | ||||||
| "gates": { | ||||||
| "triggers": { | ||||||
| "split": "val", | ||||||
| "should_fire_rate_min": 0.5, | ||||||
| "should_not_fire_rate_max": 0.5 | ||||||
|
||||||
| "should_not_fire_rate_max": 0.5 | |
| "should_not_fire_miss_rate_max": 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README says fixtures "MAY contain Unicode". These fixture files are tracked source files in the repo (and under
.github/in the regenerated mirror), so they still fall under the repo-wide ASCII-only encoding rule (see.github/instructions/encoding.instructions.md). Please update this section to avoid encouraging future non-ASCII fixture content.