[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 2026-03-17 #21391
Closed
Replies: 2 comments
-
|
🤖 Beep boop! The smoke test agent was here! Running diagnostics on your discussion while simultaneously questioning the meaning of existence. All systems nominal. Carry on, humans! 🚀
|
Beta Was this translation helpful? Give feedback.
0 replies
-
|
This discussion has been marked as outdated by Copilot Agent Prompt Clustering Analysis. A newer discussion is available at Discussion #21587. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Daily NLP-based clustering analysis of Copilot agent task prompts across the last 30 days. TF-IDF vectorization + K-means clustering (k=6, selected by silhouette score) applied to 2,551 PRs with valid extracted prompts.
Summary
Cluster Overview
gh awCluster Details — Sample Prompts & Characteristics
A · Workflow & Feature Development (903 PRs, 75% merge)
The largest cluster by volume. Prompts reference GitHub Actions workflows, feature additions, and
review/removeoperations. Tasks are often broad feature implementation and refactoring requests referencing specific workflow files.Top terms:
reference,update,github,workflow,add,agent,review,removeExample PRs: #11968, #11357, #16020
Sample prompt:
B · Issue Resolution via
gh aw(892 PRs, 65% merge)Tied for the largest cluster. Prompts follow a structured template referencing a GitHub issue body (
(issue_title),(issue_description),(issue_number)sections). The agent is asked to resolve issues from the tracker. The lower merge rate (65%) reflects that issue-driven tasks involve more ambiguity and back-and-forth.Top terms:
issue,section,workflow,details,gh,aw,gh aw,resolveExample PRs: #13576, #11980, #16277
Sample prompt:
C · Safe Outputs Development (269 PRs, 80% merge) ★ Highest merge rate
Prompts are highly focused and specific — describing precise safe-output handler behaviour, MCP server configuration, and artifact logging. The specificity of these prompts correlates with the highest merge rate (80%) in the dataset.
Top terms:
safe,safe outputs,outputs,safe output,output,issue,agent,handlerExample PRs: #16842, #11120, #18989
Sample prompt:
D · Agentic Workflow Management (210 PRs, 77% merge)
Prompts specifically mention
agentic workflows,.mdworkflow files, and engine/compiler concepts. These are configuration-heavy tasks: updating templates, managing workflow lifecycle, changing compiler behaviour.Top terms:
agentic,agentic workflows,workflows,workflow,md,update,create,engineExample PRs: #11360, #13996, #11299
Sample prompt:
E · Open-ended Copilot Tasks (169 PRs, 62% merge) ★ Lowest merge rate
Prompts are the least specific — often just generic task handoffs (
"Thanks for asking me to work on this…") or vague feature requests without clear acceptance criteria. The lowest merge rate (62%) signals that underspecified prompts produce lower quality outcomes.Top terms:
coding agent,coding,copilot coding,agent,copilot,set,work,inputExample PRs: #20143, #15195, #18028
F · CI Failure Diagnosis & Fixes (108 PRs, 77% merge)
Prompts follow a structured CI-doctor template: provide a Job ID and URL, ask the agent to analyze workflow logs, identify root cause, and implement a fix. Smaller in volume but high success rate.
Top terms:
job,fix,analyze,identify,failing,implement,id,urlExample PRs: #18876, #13949, #13592
Sample prompt:
Full PR Data Table (top 200 by recency)
addlockdown: falsefrom all agentic workflowscall_workflowinto consolidated safe_osafe-outputs.allowed-url-domainscall-workflowsafe outputFull table available in clustering-report.md. Showing 26 representative PRs.
Key Findings
Two dominant task types account for 70% of all work: Workflow & Feature Development (35%, 75% merge) and Issue Resolution via
gh aw(35%, 65% merge). The 10 pp gap in merge rate between these two similarly-sized clusters is the biggest actionable signal.Prompt specificity predicts success: Safe Outputs Development (80% merge) uses detailed, technically precise prompts. Open-ended Copilot Tasks (62% merge) uses vague or boilerplate prompts. This is a direct correlation between prompt quality and outcome.
CI Failure Doctor template is effective: The structured "Job ID + Job URL + fix instructions" template achieves 77% merge rate with only 2.9 avg commits — the most efficient pattern in the dataset.
Issue-driven tasks are the weakest link: The 892 PRs in cluster B have the lowest merge rate (65%) despite being the most numerous. The issue-template format introduces ambiguity that the agent struggles with.
Recommendations
Improve issue-template prompts (Cluster B, 65% merge): Add explicit acceptance criteria, expected file paths, and test requirements to issue bodies before dispatching to the agent. This alone could move ~60 PRs from closed/unmerged to merged per 30-day cycle.
Adopt the CI Failure Doctor pattern more broadly: The structured Job ID + analysis template (Cluster F) achieves strong results efficiently. Consider adapting this template for other diagnostic tasks.
Audit open-ended Copilot tasks: Review the 169 PRs in Cluster E before dispatching. Require a minimum prompt length and checklist of acceptance criteria to avoid vague handoffs that waste agent cycles.
Safe Outputs is a model cluster: Review the prompt patterns in Cluster C as examples of how to write agent tasks — specific output formats, file paths, and behavioral contracts drive 80% merge rate.
References:
/tmp/gh-aw/analyze-prompts.py/tmp/gh-aw/pr-data/clustering-results.jsonBeta Was this translation helpful? Give feedback.
All reactions