You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Date range analyzed: 2026-04-27 (single-day snapshot, all 138 runs occurred today) Repository: github/gh-aw
Executive Summary
138 runs across 68 workflows completed today with no escalation-eligible episodes and zero MCP failures or blocked-request episodes at the episode level. The portfolio is broadly healthy. The single operationally notable signal is a transient blocked_requests_increase classification on one AI Moderator run (since resolved to stable on the next run). Cost is highly concentrated: three workflows — [aw] Failure Investigator (6h), Schema Consistency Checker, and Documentation Unbloat — account for $11.02 of the total $13.58 billed (all via Anthropic/Claude; Copilot-engine costs are $0). The primary portfolio observation is that the repository runs a very broad set of workflows (68), many in the general_automation and issue_response domains, with exploratory execution style and high token volume that is not always justified by the domain. The graph lineage is entirely flat (0 edges), meaning no DAG orchestration is being detected; all 138 episodes are standalone.
Observability note: All 104 unknown-engine runs report $0 cost because token billing data is absent for those runs. The total cost figure of $13.58 reflects only the 7 Claude Code runs and partially the 5 Codex runs. Effective tokens are used as the primary efficiency proxy for the majority of workflows.
Key Metrics
Metric
Value
Date range
2026-04-27
Workflows analyzed
68
Runs analyzed
138
Episodes analyzed
138 (1:1 with runs; 0 DAG edges)
High-confidence episodes
137
Escalation-eligible episodes
0
Runs classified risky
1 (AI Moderator, transient)
Runs with medium/high severity assessments
2 (Agent Persona Explorer, Layout Specification Maintainer — poor_control_node_count=1 each)
MCP failure count (all episodes)
0
Blocked-request count (all episodes)
0
Total estimated cost (Anthropic/Codex only)
$13.58
Total effective tokens
15.1M
Total action minutes
382 min
Engine breakdown
Copilot CLI: 22, Claude Code: 7, Codex: 5, unknown: 104
Workflows with overkill_for_agentic pattern
~17 (exploratory style in triage/issue_response/general_automation domains)
Workflows with repeated latest_success fallback
0
Highest Risk Episodes
No episodes are escalation-eligible. The single risk signal is:
AI Moderator — 1 of 5 runs today classified risky (blocked_requests_increase). The following run reverted to stable (cohort_match baseline). This is a transient fluctuation, not a regression pattern. No action required.
Two episodes show poor_control_node_count = 1:
Agent Persona Explorer (1 run, exploratory, issue_response domain)
Both are isolated occurrences. Neither crosses the 14-day escalation threshold.
Episode Regressions
No repeated regressions detected. The 138-episode sample is a single day, limiting regression visibility. Key observations:
Visual Regression Checker: 4+1 errors across 2 runs today — highest error rate in the repository. Likely an environment or dependency issue rather than an agentic control regression.
Smoke CI: 4 errors in 1 run. Consistent with known infrastructure smoke failures.
[aw] Failure Investigator (6h): 1 error in one run, high token/cost profile. Functioning as designed (long-running research agent).
Visual Diagnostics
1. Episode Risk-Cost Frontier
Decision: Schema Consistency Checker and [aw] Failure Investigator dominate the token frontier with zero risk signal — high cost but justified for their research/validation domains.
Why it matters: The frontier reveals no workflows combining both high cost AND high risk, which is the healthiest possible shape. Cost optimization (not risk mitigation) is the primary lever available. Note: Copilot-engine effective tokens appear as $0 in billing but do consume quota.
2. Workflow Stability Matrix
Decision: AI Moderator is the only repeat offender on risky_run_rate; the matrix is otherwise uniformly clean, indicating no chronic control problems across the portfolio.
Why it matters: The repository does not have broad instability — it has one workflow with a transient signal and two with isolated poor-control events. The dominant instability driver is risky_run_rate concentrated in AI Moderator, which self-corrected.
3. Repository Portfolio Map
Decision: High-token workflows (Schema Consistency Checker, Failure Investigator, Documentation Unbloat, Go Fan) belong in optimize; the large cluster of low-token, high-frequency workflows belongs in keep; smoke/test workflows belong in simplify.
Why it matters: The repository has a healthy core (keep quadrant) carrying most of the run volume at low cost, with a small set of expensive-but-valuable research agents in optimize. The review quadrant contains candidates for right-sizing or deterministic replacement.
4. Workflow Overlap Matrix
Decision: Contribution Check and Schema Consistency Checker show moderate overlap with each other and with [aw] Failure Investigator via shared general_automation/exploratory behavior cluster — worth reviewing for potential consolidation or pre-step extraction.
Why it matters: The overlap is behavior-cluster-based, not confirmed by workflow definitions. It is suggestive rather than conclusive. Consolidation would require confirming trigger and scope alignment.
Portfolio Opportunities
Note: Domain confidence is moderate — 56/138 runs (41%) fall into general_automation, and 54/138 (39%) into issue_response, suggesting the domain classifier is collapsing diverse workflows. Portfolio comparisons below use behavior clusters as a fallback grouping.
[aw] Failure Investigator (6h) — $4.59, 6.9M tokens, exploratory/research. High value for its domain; 2 runs today at full depth. Consider whether a tighter tool scope or pre-summarization step could reduce tokens without losing coverage.
Schema Consistency Checker — $3.69, 8.1M tokens, single run, exploratory/general_automation. At 8M tokens for one run this is the most token-intensive workflow. Evaluate whether a deterministic schema diff pre-step could reduce agent scope.
Documentation Unbloat — $2.73, 4.8M tokens, 3 runs, directed/issue_response. Consistent multi-run use; cost is justified if documentation quality is tracked. One error run today worth monitoring.
/cloclo, Scout, Q, Archie — 9-10 runs each, 0 tokens (Copilot engine, no billing data), low action minutes, directed style. These run frequently at low overhead and appear narrow-scope. If they are reading/aggregating, they may be partially reducible to deterministic pre-steps.
Auto-Triage Issues — issue_response domain, 11 action minutes, directed. Strong candidate for deterministic label matching + deterministic routing with a small model for edge cases.
Review (potentially overlapping or weakly justified):
Daily CLI Performance Agent and Daily CLI Tools Exploratory Tester — both daily scheduled, general_automation, exploratory style, high token count. Overlap in name family and schedule family; worth confirming whether they cover distinct dimensions or could be merged.
Smoke * family (8 workflows) — consistent, narrow, deterministic-grade tasks wrapped in agent shells. These are likely infrastructure tests; if they only check pass/fail they may not need agentic execution.
View full workflow inventory (all 68 workflows)
Workflow
Runs
Tokens
Cost
Domain
Style
Errors
[aw] Failure Investigator (6h)
2
6,895,082
$4.59
research
exploratory
1
Schema Consistency Checker
1
8,121,019
$3.69
general_automation
exploratory
0
Documentation Unbloat
3
4,838,793
$2.73
issue_response
directed
1
Go Fan
1
2,009,806
$1.60
general_automation
exploratory
0
Contribution Check
2
2,112,884
$0
general_automation
exploratory
0
Layout Specification Maintainer
1
1,574,855
$0
repo_maintenance
exploratory
0
Copilot PR Prompt Pattern Analysis
1
981,721
$0
research
exploratory
0
jsweep - JavaScript Unbloater
1
974,246
$0
code_fix
exploratory
0
Daily CLI Performance Agent
1
759,592
$0
general_automation
exploratory
0
Daily CLI Tools Exploratory Tester
1
721,404
$0
general_automation
exploratory
1
Agent Persona Explorer
1
660,888
$0
issue_response
exploratory
0
GPL Dependency Cleaner (gpclean)
1
656,472
$0
general_automation
exploratory
0
CLI Version Checker
1
454,395
$0.57
general_automation
exploratory
0
Agent Performance Analyzer
1
447,592
$0
research
adaptive
0
Code Simplifier
1
427,198
$0
code_fix
exploratory
0
Test Quality Sentinel
1
176,802
$0
general_automation
directed
0
Issue Monster
6
224,646
$0
issue_response
directed
0
AI Moderator
5
0
$0
general_automation
directed
0
Design Decision Gate 🏗️
2
242,741
$0.39
general_automation
directed
0
Scout
9
0
$0
issue_response
directed
0
/cloclo
10
0
$0
issue_response
directed
0
Q
9
0
$0
issue_response
directed
0
Archie
8
0
$0
issue_response
directed
0
(remaining 45 workflows)
varied
varied
$0
varied
directed
varied
Recommended Actions
No escalation required. Zero episodes are escalation-eligible. The AI Moderator blocked_requests_increase signal self-corrected.
Investigate Visual Regression Checker errors (4 errors today, 2 runs). Not an agentic control problem — likely an environment dependency. Run a targeted audit if errors persist tomorrow.
Review Schema Consistency Checker token footprint. At 8.1M tokens for a single run, this is the highest per-run token cost in the repository. A deterministic schema-diff pre-step could significantly reduce agent scope.
Evaluate Daily CLI Performance Agent and Daily CLI Tools Exploratory Tester overlap. Both are daily, exploratory, general_automation. Confirm they cover distinct dimensions before the next schedule cycle.
Smoke family right-sizing. 8 Smoke workflows running at low token/low error rates. If they are pure pass/fail infrastructure checks, consider replacing agentic execution with deterministic CI steps.
Add DAG lineage instrumentation. 0 edges detected means no orchestrator→worker relationships are being captured. If any workflows delegate to others (e.g., Failure Investigator spawning sub-agents), enabling lineage tracking will improve future observability.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Date range analyzed: 2026-04-27 (single-day snapshot, all 138 runs occurred today)
Repository: github/gh-aw
Executive Summary
138 runs across 68 workflows completed today with no escalation-eligible episodes and zero MCP failures or blocked-request episodes at the episode level. The portfolio is broadly healthy. The single operationally notable signal is a transient
blocked_requests_increaseclassification on one AI Moderator run (since resolved to stable on the next run). Cost is highly concentrated: three workflows — [aw] Failure Investigator (6h), Schema Consistency Checker, and Documentation Unbloat — account for $11.02 of the total $13.58 billed (all via Anthropic/Claude; Copilot-engine costs are $0). The primary portfolio observation is that the repository runs a very broad set of workflows (68), many in thegeneral_automationandissue_responsedomains, with exploratory execution style and high token volume that is not always justified by the domain. The graph lineage is entirely flat (0 edges), meaning no DAG orchestration is being detected; all 138 episodes are standalone.Key Metrics
riskypoor_control_node_count=1each)overkill_for_agenticpatternlatest_successfallbackHighest Risk Episodes
No episodes are escalation-eligible. The single risk signal is:
risky(blocked_requests_increase). The following run reverted tostable(cohort_match baseline). This is a transient fluctuation, not a regression pattern. No action required.Two episodes show
poor_control_node_count = 1:Both are isolated occurrences. Neither crosses the 14-day escalation threshold.
Episode Regressions
No repeated regressions detected. The 138-episode sample is a single day, limiting regression visibility. Key observations:
Visual Diagnostics
1. Episode Risk-Cost Frontier
Decision: Schema Consistency Checker and [aw] Failure Investigator dominate the token frontier with zero risk signal — high cost but justified for their research/validation domains.
Why it matters: The frontier reveals no workflows combining both high cost AND high risk, which is the healthiest possible shape. Cost optimization (not risk mitigation) is the primary lever available. Note: Copilot-engine effective tokens appear as $0 in billing but do consume quota.
2. Workflow Stability Matrix
Decision: AI Moderator is the only repeat offender on
risky_run_rate; the matrix is otherwise uniformly clean, indicating no chronic control problems across the portfolio.Why it matters: The repository does not have broad instability — it has one workflow with a transient signal and two with isolated poor-control events. The dominant instability driver is
risky_run_rateconcentrated in AI Moderator, which self-corrected.3. Repository Portfolio Map
Decision: High-token workflows (Schema Consistency Checker, Failure Investigator, Documentation Unbloat, Go Fan) belong in
optimize; the large cluster of low-token, high-frequency workflows belongs inkeep; smoke/test workflows belong insimplify.Why it matters: The repository has a healthy core (
keepquadrant) carrying most of the run volume at low cost, with a small set of expensive-but-valuable research agents inoptimize. Thereviewquadrant contains candidates for right-sizing or deterministic replacement.4. Workflow Overlap Matrix
Decision: Contribution Check and Schema Consistency Checker show moderate overlap with each other and with [aw] Failure Investigator via shared
general_automation/exploratorybehavior cluster — worth reviewing for potential consolidation or pre-step extraction.Why it matters: The overlap is behavior-cluster-based, not confirmed by workflow definitions. It is suggestive rather than conclusive. Consolidation would require confirming trigger and scope alignment.
Portfolio Opportunities
Optimize (high-token, high-value — consider right-sizing):
Simplify / deterministic candidates (lean + directed + narrow domain):
/cloclo,Scout,Q,Archie— 9-10 runs each, 0 tokens (Copilot engine, no billing data), low action minutes, directed style. These run frequently at low overhead and appear narrow-scope. If they are reading/aggregating, they may be partially reducible to deterministic pre-steps.Review (potentially overlapping or weakly justified):
general_automation, exploratory style, high token count. Overlap in name family and schedule family; worth confirming whether they cover distinct dimensions or could be merged.View full workflow inventory (all 68 workflows)
Recommended Actions
No escalation required. Zero episodes are escalation-eligible. The AI Moderator
blocked_requests_increasesignal self-corrected.Investigate Visual Regression Checker errors (4 errors today, 2 runs). Not an agentic control problem — likely an environment dependency. Run a targeted
auditif errors persist tomorrow.Review Schema Consistency Checker token footprint. At 8.1M tokens for a single run, this is the highest per-run token cost in the repository. A deterministic schema-diff pre-step could significantly reduce agent scope.
Evaluate Daily CLI Performance Agent and Daily CLI Tools Exploratory Tester overlap. Both are daily, exploratory, general_automation. Confirm they cover distinct dimensions before the next schedule cycle.
Smoke family right-sizing. 8 Smoke workflows running at low token/low error rates. If they are pure pass/fail infrastructure checks, consider replacing agentic execution with deterministic CI steps.
Add DAG lineage instrumentation. 0 edges detected means no orchestrator→worker relationships are being captured. If any workflows delegate to others (e.g., Failure Investigator spawning sub-agents), enabling lineage tracking will improve future observability.
References:
Beta Was this translation helpful? Give feedback.
All reactions