[observability] Agentic Observability Report — 2026-04-27 #28682

2026-04-27T09:30:09Z

github-actions[bot]
Bot Apr 27, 2026

Date range analyzed: 2026-04-27 (single-day snapshot, all 138 runs occurred today)
Repository: github/gh-aw

Executive Summary

138 runs across 68 workflows completed today with no escalation-eligible episodes and zero MCP failures or blocked-request episodes at the episode level. The portfolio is broadly healthy. The single operationally notable signal is a transient blocked_requests_increase classification on one AI Moderator run (since resolved to stable on the next run). Cost is highly concentrated: three workflows — [aw] Failure Investigator (6h), Schema Consistency Checker, and Documentation Unbloat — account for $11.02 of the total $13.58 billed (all via Anthropic/Claude; Copilot-engine costs are $0). The primary portfolio observation is that the repository runs a very broad set of workflows (68), many in the general_automation and issue_response domains, with exploratory execution style and high token volume that is not always justified by the domain. The graph lineage is entirely flat (0 edges), meaning no DAG orchestration is being detected; all 138 episodes are standalone.

Observability note: All 104 unknown-engine runs report $0 cost because token billing data is absent for those runs. The total cost figure of $13.58 reflects only the 7 Claude Code runs and partially the 5 Codex runs. Effective tokens are used as the primary efficiency proxy for the majority of workflows.

Key Metrics

Metric	Value
Date range	2026-04-27
Workflows analyzed	68
Runs analyzed	138
Episodes analyzed	138 (1:1 with runs; 0 DAG edges)
High-confidence episodes	137
Escalation-eligible episodes	0
Runs classified `risky`	1 (AI Moderator, transient)
Runs with medium/high severity assessments	2 (Agent Persona Explorer, Layout Specification Maintainer — `poor_control_node_count=1` each)
MCP failure count (all episodes)	0
Blocked-request count (all episodes)	0
Total estimated cost (Anthropic/Codex only)	$13.58
Total effective tokens	15.1M
Total action minutes	382 min
Engine breakdown	Copilot CLI: 22, Claude Code: 7, Codex: 5, unknown: 104
Workflows with `overkill_for_agentic` pattern	~17 (exploratory style in triage/issue_response/general_automation domains)
Workflows with repeated `latest_success` fallback	0

Highest Risk Episodes

No episodes are escalation-eligible. The single risk signal is:

AI Moderator — 1 of 5 runs today classified risky (blocked_requests_increase). The following run reverted to stable (cohort_match baseline). This is a transient fluctuation, not a regression pattern. No action required.

Two episodes show poor_control_node_count = 1:

Agent Persona Explorer (1 run, exploratory, issue_response domain)
Layout Specification Maintainer (1 run, exploratory, repo_maintenance domain)

Both are isolated occurrences. Neither crosses the 14-day escalation threshold.

Episode Regressions

No repeated regressions detected. The 138-episode sample is a single day, limiting regression visibility. Key observations:

Visual Regression Checker: 4+1 errors across 2 runs today — highest error rate in the repository. Likely an environment or dependency issue rather than an agentic control regression.
Smoke CI: 4 errors in 1 run. Consistent with known infrastructure smoke failures.
[aw] Failure Investigator (6h): 1 error in one run, high token/cost profile. Functioning as designed (long-running research agent).

Visual Diagnostics

1. Episode Risk-Cost Frontier

Decision: Schema Consistency Checker and [aw] Failure Investigator dominate the token frontier with zero risk signal — high cost but justified for their research/validation domains.

Why it matters: The frontier reveals no workflows combining both high cost AND high risk, which is the healthiest possible shape. Cost optimization (not risk mitigation) is the primary lever available. Note: Copilot-engine effective tokens appear as $0 in billing but do consume quota.

2. Workflow Stability Matrix

Decision: AI Moderator is the only repeat offender on risky_run_rate; the matrix is otherwise uniformly clean, indicating no chronic control problems across the portfolio.

Why it matters: The repository does not have broad instability — it has one workflow with a transient signal and two with isolated poor-control events. The dominant instability driver is risky_run_rate concentrated in AI Moderator, which self-corrected.

3. Repository Portfolio Map

Decision: High-token workflows (Schema Consistency Checker, Failure Investigator, Documentation Unbloat, Go Fan) belong in optimize; the large cluster of low-token, high-frequency workflows belongs in keep; smoke/test workflows belong in simplify.

Why it matters: The repository has a healthy core (keep quadrant) carrying most of the run volume at low cost, with a small set of expensive-but-valuable research agents in optimize. The review quadrant contains candidates for right-sizing or deterministic replacement.

4. Workflow Overlap Matrix

Decision: Contribution Check and Schema Consistency Checker show moderate overlap with each other and with [aw] Failure Investigator via shared general_automation/exploratory behavior cluster — worth reviewing for potential consolidation or pre-step extraction.

Why it matters: The overlap is behavior-cluster-based, not confirmed by workflow definitions. It is suggestive rather than conclusive. Consolidation would require confirming trigger and scope alignment.

Portfolio Opportunities

Note: Domain confidence is moderate — 56/138 runs (41%) fall into general_automation, and 54/138 (39%) into issue_response, suggesting the domain classifier is collapsing diverse workflows. Portfolio comparisons below use behavior clusters as a fallback grouping.

Optimize (high-token, high-value — consider right-sizing):

[aw] Failure Investigator (6h) — $4.59, 6.9M tokens, exploratory/research. High value for its domain; 2 runs today at full depth. Consider whether a tighter tool scope or pre-summarization step could reduce tokens without losing coverage.
Schema Consistency Checker — $3.69, 8.1M tokens, single run, exploratory/general_automation. At 8M tokens for one run this is the most token-intensive workflow. Evaluate whether a deterministic schema diff pre-step could reduce agent scope.
Documentation Unbloat — $2.73, 4.8M tokens, 3 runs, directed/issue_response. Consistent multi-run use; cost is justified if documentation quality is tracked. One error run today worth monitoring.

Simplify / deterministic candidates (lean + directed + narrow domain):

/cloclo, Scout, Q, Archie — 9-10 runs each, 0 tokens (Copilot engine, no billing data), low action minutes, directed style. These run frequently at low overhead and appear narrow-scope. If they are reading/aggregating, they may be partially reducible to deterministic pre-steps.
Auto-Triage Issues — issue_response domain, 11 action minutes, directed. Strong candidate for deterministic label matching + deterministic routing with a small model for edge cases.

Review (potentially overlapping or weakly justified):

Daily CLI Performance Agent and Daily CLI Tools Exploratory Tester — both daily scheduled, general_automation, exploratory style, high token count. Overlap in name family and schedule family; worth confirming whether they cover distinct dimensions or could be merged.
Smoke * family (8 workflows) — consistent, narrow, deterministic-grade tasks wrapped in agent shells. These are likely infrastructure tests; if they only check pass/fail they may not need agentic execution.

View full workflow inventory (all 68 workflows)

Workflow	Runs	Tokens	Cost	Domain	Style	Errors
[aw] Failure Investigator (6h)	2	6,895,082	$4.59	research	exploratory	1
Schema Consistency Checker	1	8,121,019	$3.69	general_automation	exploratory	0
Documentation Unbloat	3	4,838,793	$2.73	issue_response	directed	1
Go Fan	1	2,009,806	$1.60	general_automation	exploratory	0
Contribution Check	2	2,112,884	$0	general_automation	exploratory	0
Layout Specification Maintainer	1	1,574,855	$0	repo_maintenance	exploratory	0
Copilot PR Prompt Pattern Analysis	1	981,721	$0	research	exploratory	0
jsweep - JavaScript Unbloater	1	974,246	$0	code_fix	exploratory	0
Daily CLI Performance Agent	1	759,592	$0	general_automation	exploratory	0
Daily CLI Tools Exploratory Tester	1	721,404	$0	general_automation	exploratory	1
Agent Persona Explorer	1	660,888	$0	issue_response	exploratory	0
GPL Dependency Cleaner (gpclean)	1	656,472	$0	general_automation	exploratory	0
CLI Version Checker	1	454,395	$0.57	general_automation	exploratory	0
Agent Performance Analyzer	1	447,592	$0	research	adaptive	0
Code Simplifier	1	427,198	$0	code_fix	exploratory	0
Test Quality Sentinel	1	176,802	$0	general_automation	directed	0
Issue Monster	6	224,646	$0	issue_response	directed	0
AI Moderator	5	0	$0	general_automation	directed	0
Design Decision Gate 🏗️	2	242,741	$0.39	general_automation	directed	0
Scout	9	0	$0	issue_response	directed	0
/cloclo	10	0	$0	issue_response	directed	0
Q	9	0	$0	issue_response	directed	0
Archie	8	0	$0	issue_response	directed	0
(remaining 45 workflows)	varied	varied	$0	varied	directed	varied

Recommended Actions

No escalation required. Zero episodes are escalation-eligible. The AI Moderator blocked_requests_increase signal self-corrected.
Investigate Visual Regression Checker errors (4 errors today, 2 runs). Not an agentic control problem — likely an environment dependency. Run a targeted audit if errors persist tomorrow.
Review Schema Consistency Checker token footprint. At 8.1M tokens for a single run, this is the highest per-run token cost in the repository. A deterministic schema-diff pre-step could significantly reduce agent scope.
Evaluate Daily CLI Performance Agent and Daily CLI Tools Exploratory Tester overlap. Both are daily, exploratory, general_automation. Confirm they cover distinct dimensions before the next schedule cycle.
Smoke family right-sizing. 8 Smoke workflows running at low token/low error rates. If they are pure pass/fail infrastructure checks, consider replacing agentic execution with deterministic CI steps.
Add DAG lineage instrumentation. 0 edges detected means no orchestrator→worker relationships are being captured. If any workflows delegate to others (e.g., Failure Investigator spawning sub-agents), enabling lineage tracking will improve future observability.

References:

§24982458545 — [aw] Failure Investigator (error run)
§24983241332 — AI Moderator (risky classification)
§24983824636 — Visual Regression Checker (4 errors)

Generated by Agentic Observability Kit · ● 2.2M · ◷

expires on May 4, 2026, 9:30 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[observability] Agentic Observability Report — 2026-04-27 #28682

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[observability] Agentic Observability Report — 2026-04-27 #28682

Uh oh!

github-actions[bot] Bot Apr 27, 2026

Executive Summary

Key Metrics

Highest Risk Episodes

Episode Regressions

Visual Diagnostics

1. Episode Risk-Cost Frontier

2. Workflow Stability Matrix

3. Repository Portfolio Map

4. Workflow Overlap Matrix

Portfolio Opportunities

Recommended Actions

Replies: 0 comments

github-actions[bot]
Bot Apr 27, 2026