Skip to content

ci-analysis skill: restore domain examples from eval regression analysis#124416

Merged
lewing merged 2 commits intodotnet:mainfrom
lewing:ci-analysis-eval-regression-fix
Feb 14, 2026
Merged

ci-analysis skill: restore domain examples from eval regression analysis#124416
lewing merged 2 commits intodotnet:mainfrom
lewing:ci-analysis-eval-regression-fix

Conversation

@lewing
Copy link
Member

@lewing lewing commented Feb 14, 2026

Summary

Waza eval progression testing (16 runs across 4 skill versions, then 12 more runs validating fixes) revealed the tool-agnostic refactor (#124398) caused a 68% regression in tool calls (25→42) for the build progression task while other tasks were stable or improved.

Root cause: Three domain-specific examples were incorrectly classified as tool schema restatements and removed. These are domain knowledge the agent genuinely cannot infer from tool descriptions:

Restored Why it matters
refs/pull/{PR}/merge branch pattern + AzDO query params Agent can't infer the PR merge branch ref format
gh api .../git/commits/{sha} --jq '.parents[0].sha' + get_commit MCP alternative Merge commit parent extraction pattern
log ID 5, line 500+ hints with emphasis Magic numbers for checkout log location
Stop signal in Step 4 Agent was doing 5-7 extra tool calls after having enough data

Eval Results (Tool Calls)

Each fix was validated independently with a full 4-task eval run:

Version Build Progression CI Status Helix Retry Total
8fdec1f (pre-refactor best) 25 5 20 5 55
833041c (tool-agnostic, #124398) 42 📉 10 📉 21 4 77
75f75d3 (restore domain examples) 36 7 24 4 71
a9f5140 (+branch ref hints) 32 6 26 4 68
aa4193b (this PR) 25 6 14 4 49

Eval Results (Duration)

Version Build Progression CI Status Helix Retry Total
8fdec1f (pre-refactor best) 5m20s 1m28s 3m13s 1m03s 11m04s
833041c (tool-agnostic, #124398) 7m38s 1m38s 3m34s 0m56s 13m45s
aa4193b (this PR) 4m25s 1m08s 2m28s 0m45s 8m46s

The final version matches the pre-refactor best on Build Progression (25 tools) and beats it overall (49 vs 55 total tools, 8m46s vs 11m04s).

Key Insight

Simple tasks (retry, CI status) benefit from less prescriptive guidance — retry improved from 5 to 4 calls. Complex multi-step tasks (build progression) need specific domain examples showing branch ref patterns, field names, and log locations. The rule of thumb: if removing an example leaves the agent unable to accomplish the task efficiently AND the information isn't in any tool description, it's a domain example — keep it.

Changes

  • build-progression-analysis.md: Restore key parameters, merge parent extraction example, checkout log hints, stop signal
  • delegation-patterns.md: ALL CAPS emphasis on log ID/line hints in subagent template
  • SKILL.md: Mention refs/pull/{PR}/merge in step 1

…ogression

Waza eval progression testing (16 runs across 4 skill versions) revealed
the tool-agnostic refactor (dotnet#124398) caused a 68% regression in tool
calls (25→42) for the build progression task. Root cause: domain-specific
examples were incorrectly classified as tool schema restatements.

Changes:
- build-progression-analysis.md: restore key AzDO query parameters
  (branchName, queryOrder, top, project) as inline hints
- build-progression-analysis.md: restore gh api merge parent extraction
  example and mention get_commit MCP alternative
- build-progression-analysis.md: restore logId:5 / startLine:500 hints
  with bold emphasis for checkout log extraction
- build-progression-analysis.md: add stop signal — present findings when
  the progression table and transition are identified
- delegation-patterns.md: add bold emphasis on log ID/line hints in
  subagent prompt template
- SKILL.md: mention refs/pull/{PR}/merge branch pattern in step 1

These are domain examples (branch ref formats, field names, log locations,
jq expressions) that agents cannot infer from tool descriptions alone.
Simple tasks (retry) still benefit from less prescriptive guidance.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 14, 2026 01:48
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Feb 14, 2026
@lewing lewing added area-skills Agent Skills and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Feb 14, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Restores domain-specific guidance in the ci-analysis skill docs that was removed during the tool-agnostic refactor, aiming to reduce unnecessary tool calls and improve efficiency for complex build progression investigations.

Changes:

  • Reintroduces AzDO build query specifics for refs/pull/{PR}/merge (project/ordering/top) and clarifies where pr.sourceSha lives.
  • Restores a concrete merge-parent extraction example for obtaining target branch HEAD from the merge commit.
  • Reinforces “checkout log ID 5 / line 500+” hints and adds an explicit stop signal once the progression table + transition are identified.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
.github/skills/ci-analysis/references/delegation-patterns.md Adds emphasis to the checkout-log hint in the subagent delegation template.
.github/skills/ci-analysis/references/build-progression-analysis.md Restores key domain examples/parameters for PR build listing, merge-parent extraction, checkout-log extraction, and stopping criteria.
.github/skills/ci-analysis/SKILL.md Updates PR analysis mode description to reference querying AzDO builds on the PR merge ref for full history.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lewing lewing requested a review from steveisok February 14, 2026 01:55
@lewing lewing merged commit 50e7fbb into dotnet:main Feb 14, 2026
18 checks passed
@lewing lewing deleted the ci-analysis-eval-regression-fix branch February 14, 2026 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-skills Agent Skills

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants