[observability] Agentic Observability Report — 2026-03-24 to 2026-04-23 #28011

2026-04-23T07:36:05Z

github-actions[bot]
Bot Apr 23, 2026

Date range analyzed: 2026-03-24 → 2026-04-23 (30 days)
Workflow run: §24822586404

Executive Summary

46 runs analyzed across 21 distinct workflows. All 46 episodes are standalone (no multi-run DAGs detected; edges[] is empty). Episode confidence is uniformly high. Total spend is ~$6.04 in estimated cost (where available) and 231 action-minutes.

The primary concerns are concentrated in the Smoke test family and a handful of scheduled heavy workflows:

Smoke Claude and Smoke Copilot are the highest-risk workflows: resource_heavy_for_domain (high severity), poor_agentic_control (medium severity), and write_heavy actuation — unusual posture for a smoke test.
Smoke Crush has a 100% failure rate (2/2 runs failed), flagged by the observability engine as a reliability hotspot.
Daily CLI Tools Exploratory Tester shows extreme firewall pressure: 45 blocked requests out of 91 total (49%), the worst network friction in the repository.
[aw] Failure Investigator (6h) is the most expensive single episode at $1.89 estimated cost — a research-domain workflow with resource_heavy_for_domain (high severity) and a partially-reducible pre-step opportunity.
The majority of recurring PR-triggered workflows (Design Decision Gate, Test Quality Sentinel, Smoke CI) are stable against cohort baselines with minor turns_increase/turns_decrease drift — no escalation threshold crossed.

No episode has escalation_eligible == true. No escalation issue is created.

Key Metrics

Metric	Value
Workflows analyzed	21
Runs analyzed	46
Episodes analyzed	46 (all standalone)
High-confidence episodes	46 (100%)
Runs with `cls_label == risky`	0
Runs with `resource_heavy_for_domain` (any severity)	8
Runs with `poor_agentic_control` (medium+)	3
Runs with `partially_reducible`	17
Runs with `overkill_for_agentic`	1
Runs with `model_downgrade_available`	1
Write-capable runs	4
Estimated total cost	$6.04
Total action-minutes	231
Total tokens	17.6M (15.9M effective)
Smoke Crush failure rate	100% (2/2)
Daily CLI Tester blocked request rate	49%

Highest Risk Episodes

Episode	Risk Score	Action Min	Key Signals
Smoke Claude (×2 runs)	High	21	`resource_heavy` + `poor_agentic_control` (med) + `write_heavy` actuation
Smoke Copilot	High	12	`resource_heavy` + `poor_agentic_control` (med) + `write_heavy` actuation
Daily CLI Tools Exploratory Tester	Medium	10	45/91 blocked requests (49%), `resource_heavy`, selective_write
[aw] Failure Investigator (6h)	Medium	12	$1.89 cost, `resource_heavy` (high), `partially_reducible`
Smoke Crush	— (failed)	6	100% failure rate, reliability hotspot

Smoke Claude / Smoke Copilot are the most concerning: these are smoke tests that should be lean and narrow, but both showed write_heavy actuation (safe outputs: PR comments, reviews, code pushes, issue creation) and high resource consumption. The poor_agentic_control assessment indicates the control loop quality is below expectations for a smoke test domain.

Note: No workflow cleared the escalation thresholds (2+ risky-labeled runs, 2+ new MCP failures, 2+ blocked-request increases, or 2+ high-severity resource/control assessments for the same workflow within 14 days). Smoke Claude has 2 resource_heavy_for_domain high assessments and 2 poor_agentic_control medium assessments — borderline. These are cross-engine comparison smoke tests, so some write behavior may be intentional, but the posture warrants owner review.

Episode Regressions

Test Quality Sentinel shows execution drift: 4–14 turns across 8 runs (avg 6.9), with several turns_increase and turns_decrease classifications against cohort baselines. This indicates changing task shape or prompt instability rather than a consistent failure mode. Severity is low-to-medium.

Design Decision Gate shows similar turn variability (3 turns_increase + 3 turns_decrease vs. cohort) but is stable in cost and control quality. Drift here is benign.

Smoke CI had one turns_increase run (24819282263) against a stable cohort — minor.

Visual Diagnostics

1. Episode Risk-Cost Frontier

Decision: Smoke Claude, Failure Investigator, and Smoke Copilot occupy the Pareto frontier — expensive and risky relative to the repository.

Why it matters: estimated_cost is sparse (zeroed for most workflows), so action-minutes serves as the cost proxy. Smoke Claude and Smoke Copilot combine moderate cost with the highest risk scores, driven by write-heavy actuation and poor-control assessments. The Failure Investigator is cost-heavy but lower-risk, making it an optimization candidate rather than an immediate control concern.

2. Workflow Stability Matrix

Decision: Smoke Claude and Smoke Copilot are the least stable workflows; the broad majority of the portfolio is clean across all six stability dimensions.

Why it matters: Resource-heavy rate and poor-control rate are concentrated in the Smoke engine family, not spread across unrelated workflows. The repository does not have broad instability — it has two point sources that need attention. All scheduled and PR-triggered operational workflows (CI, Quality Sentinel, Design Gate) read as stable or low-noise.

3. Repository Portfolio Map

Decision: Design Decision Gate, Test Quality Sentinel, and Smoke CI belong in optimize (high value, high run count); Smoke Claude and Smoke Copilot belong in review (high cost, low value proxy given instability).

Why it matters: The portfolio is healthy in its core PR-automation cluster. The review quadrant contains workflows that are either resource-heavy smoke tests (Smoke Claude/Copilot), infrequently run heavy schedulers (Failure Investigator, Daily CLI Tester, Code Simplifier, Daily CLI Performance Agent), or one-time non-recurring runs. The simplify quadrant (Auto-Triage, PR Triage Agent, Contribution Check) contains workflows that may benefit from model downgrades or deterministic pre-filtering.

4. Workflow Overlap Matrix

Decision: The Smoke engine family (Claude, Copilot, Codex, OpenCode, Crush, Gemini) forms the strongest overlap cluster; Design Decision Gate and Test Quality Sentinel show moderate co-occurrence but serve distinct purposes.

Why it matters: Overlap within the Smoke family is intentional and structural — they are cross-engine regression tests. The overlap between Design Gate and Test Quality Sentinel is weak (different domains, different engines) and does not suggest consolidation. No unintentional duplications are detected in this window.

Portfolio Opportunities

Portfolio details by quadrant

Review (high cost, low value proxy):

Smoke Claude / Smoke Copilot — write-heavy smoke tests that are expensive and show poor-control signals. If these are correctness tests for write-capable tools, scope the safe-output usage more tightly. If they are purely behavioral smoke checks, consider restricting to read-only mode.
[aw] Failure Investigator (6h) — most expensive single run ($1.89). Research domain with partially_reducible signal suggests structured log pre-filtering (deterministic) could reduce token overhead before the agentic layer.
Daily CLI Tools Exploratory Tester — 49% firewall block rate. The network allowlist appears too narrow for the tool's actual browsing behavior, or the tool is exploring domains outside its intended scope. Either expand the allowlist deliberately or constrain the tool's search breadth.
Code Simplifier / Daily CLI Performance Agent — single-run heavy schedulers. Confirm these are still in active use; if the cadence is low, a model downgrade or token-budget cap would reduce footprint.

Simplify (low cost, lower value proxy):

Auto-Triage Issues — flagged overkill_for_agentic (low severity). A deterministic label-matching rule or a smaller model could handle this task at lower overhead.
PR Triage Agent — partially_reducible + model_downgrade_available. Good candidate for a lighter model; triage tasks rarely need full reasoning depth.
Contribution Check — resource_heavy_for_domain (high severity) despite being a scheduled check. If it fires infrequently, this is acceptable; if frequent, a focused deterministic check would outperform it.

Optimize (high value, high run count):

Design Decision Gate — 8 runs, all cohort-stable, $0.21–$0.37/run, no control issues. Best-performing workflow in the repository. Minor turn drift needs no action.
Test Quality Sentinel — 8 runs, mostly stable. Turn drift (4–14 turns) is the only signal; prompt stabilization would make baseline matching more reliable.
Smoke CI — 8 runs, all stable or minor. Well-controlled, inexpensive, high-repeat. Partially reducible signal is low-severity.

Recommended Actions

Smoke Claude / Smoke Copilot — Review the write-capable tool list. Both workflows exhibit write_heavy actuation and poor_agentic_control signals. If write actions are intentional test coverage, add explicit scope guards. If not, remove write-capable tools from the smoke test configuration. (Priority: medium — 2 runs each, patterns consistent)
Smoke Crush — Investigate the 100% failure rate. 2/2 runs failed. Check whether the Crush engine endpoint is misconfigured or whether a schema change broke the smoke test. (Priority: high — reliability hotspot)
Daily CLI Tools Exploratory Tester — Audit the network allowlist. 49% blocked-request rate suggests the workflow is routinely attempting out-of-scope domains. Either narrow the tool's exploration directives or expand the allowlist deliberately. (Priority: medium)
[aw] Failure Investigator (6h) — Add a deterministic log-download and pre-filter step before the agentic layer to reduce token load. The partially_reducible signal at a $1.89/run cost makes this the highest-value optimization target in the repository. (Priority: low — single run in 30 days)
Test Quality Sentinel — Investigate prompt or task-shape changes causing 4–14 turn variance. A stable prompt with consistent input would yield more reliable cohort baselines. (Priority: low — no control failures)

Per-workflow detail

Workflow	Runs	Action Min	Engines	Domain	Instability	Assessments
Smoke Claude	2	21	Claude Code	triage	0.40	resource_heavy (high ×2), poor_ctrl (med ×2), partially_reducible (med ×2)
Smoke Copilot	1	12	GitHub Copilot CLI	triage	0.40	resource_heavy (high), poor_ctrl (med), partially_reducible (low)
Daily CLI Tools Exploratory Tester	1	10	GitHub Copilot CLI	general_automation	0.20	resource_heavy (high), partially_reducible (med); 45 blocked
[aw] Failure Investigator (6h)	1	12	Claude Code	research	0.20	resource_heavy (high), partially_reducible (low)
Code Simplifier	1	8	GitHub Copilot CLI	code_fix	0.20	resource_heavy (high), partially_reducible (low)
Contribution Check	1	6	GitHub Copilot CLI	general_automation	0.20	resource_heavy (high), partially_reducible (low)
Daily CLI Performance Agent	1	7	GitHub Copilot CLI	general_automation	0.20	resource_heavy (med), partially_reducible (low)
Test Quality Sentinel	8	37	GitHub Copilot CLI	general_automation	0.03	partially_reducible (low ×6); turns drift 4–14
Design Decision Gate	8	32	Claude Code	general_automation	0.00	none; turns drift minor
Smoke CI	8	31	GitHub Copilot CLI	general_automation	0.00	partially_reducible (low ×6); all stable
Smoke Crush	2	6	Crush	general_automation	0.00	100% failure
Smoke OpenCode	2	6	OpenCode	code_fix	0.00	stable
Smoke Gemini	2	6	Google Gemini CLI	general_automation	0.00	1 blocked request each
PR Triage Agent	1	5	GitHub Copilot CLI	triage	0.00	partially_reducible (low), model_downgrade_available (low)
Issue Monster	1	4	GitHub Copilot CLI	issue_response	0.00	none
Agent Container Smoke Test	1	4	GitHub Copilot CLI	general_automation	0.00	3 blocked; none flagged
Auto-Triage Issues	1	2	GitHub Copilot CLI	triage	0.00	overkill_for_agentic (low)
Daily Hippo Learn	1	7	GitHub Copilot CLI	general_automation	0.00	partially_reducible (low)
Changeset Generator	1	6	Codex	general_automation	0.00	4 blocked
Smoke Codex	1	8	Codex	code_fix	0.00	5 blocked
Agentic Observability Kit	1	1	GitHub Copilot CLI	code_fix	0.00	in_progress

Observability note: edges[] is empty across all 46 runs — no cross-run lineage was established in this 30-day window. All episodes are treated as standalone by the deterministic model, which is consistent with the repository's primarily PR-event and schedule-triggered architecture. No multi-run DAG chains were missed.

Cost note: estimated_cost is zero for the majority of workflows (Copilot engine costs are not surfaced). All efficiency comparisons use action_minutes as the primary cost proxy. Claude-engine workflows ($0.21–$1.89/run) are the only ones with non-zero estimated cost.

References:

§24818321063 — Smoke Claude (baseline, first run)
§24818449170 — Daily CLI Tools Exploratory Tester (45 blocked requests)
§24822224101 — Failure Investigator (highest cost run)

Generated by Agentic Observability Kit · ● 1.4M · ◷

expires on Apr 30, 2026, 7:36 AM UTC

2026-04-27T09:30:11Z

github-actions[bot]
Bot Apr 27, 2026
Author

This discussion has been marked as outdated by Agentic Observability Kit.

A newer discussion is available at Discussion #28682.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[observability] Agentic Observability Report — 2026-03-24 to 2026-04-23 #28011

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[observability] Agentic Observability Report — 2026-03-24 to 2026-04-23 #28011

Uh oh!

github-actions[bot] Bot Apr 23, 2026

Executive Summary

Key Metrics

Highest Risk Episodes

Episode Regressions

Visual Diagnostics

1. Episode Risk-Cost Frontier

2. Workflow Stability Matrix

3. Repository Portfolio Map

4. Workflow Overlap Matrix

Portfolio Opportunities

Recommended Actions

Replies: 1 comment

Uh oh!

github-actions[bot] Bot Apr 27, 2026 Author

github-actions[bot]
Bot Apr 23, 2026

github-actions[bot]
Bot Apr 27, 2026
Author