[observability] Agentic Observability Report — 2026-04-21 #27636

2026-04-21T17:14:19Z

github-actions[bot]
Bot Apr 21, 2026

Date range analyzed: 2026-03-22 → 2026-04-21 (30-day window requested; all 49 sampled runs occurred on 2026-04-21 — a single-day batch, likely triggered by a CI push event and workflow suite execution. Trend comparisons are therefore within-day only.)

Executive Summary

49 runs across 15 workflows were analyzed. No episodes were flagged escalation_eligible by the deterministic model, but two workflows — Smoke Claude and Smoke Copilot — crossed the manual escalation thresholds via repeated high-severity resource_heavy_for_domain and medium-severity poor_agentic_control assessments across all 3 of their runs in the 14-day window.

The portfolio is dominated by smoke/CI validation workflows that are cheap, stable, and justified. The operational concerns concentrate in three areas: (1) Smoke Claude and Smoke Copilot are consistently expensive and poorly controlled for their stated purpose; (2) The Daily Repository Chronicle holds the highest token volume in the dataset with 40 blocked requests; and (3) Slide Deck Maintainer shows resource-heavy, exploratory behavior with zero write actions — a strong candidate for model downgrade or deterministic simplification.

The repository has no risky-labeled runs and no MCP failures. The overall safety posture is clean.

Key Metrics

Metric	Value
Workflows analyzed	15
Runs analyzed	49
Episodes analyzed	49 (all standalone)
High-confidence episodes	49 / 49
Runs labeled `risky`	0
Runs with medium/high agentic assessment	9
Escalation-eligible episodes	0 (deterministic model)
Workflows crossing manual escalation threshold	2 (Smoke Claude, Smoke Copilot)
Total estimated cost (USD)	$3.50
Total effective tokens	9.2M
Total action minutes	227
Total errors	21 (mostly smoke-test engine failures)
MCP failures	0
Engines active	copilot (25), codex (7), claude (6), gemini (4), crush (3), opencode (3)

Cost note: estimated_cost is zero for copilot, codex, gemini, crush, and opencode runs in this sample. Charts use effective tokens / 1M as a cost proxy for those workflows. Claude runs report USD cost directly.

Highest Risk Episodes

No episodes are escalation_eligible per the deterministic model. The three highest-composite-risk episodes are:

Episode	Risk Score	Blocked Reqs	Resource Heavy	Cost Proxy
The Daily Repository Chronicle	40.0	40	1	2.22M tok
Agent Container Smoke Test ×3	4.0 ea	4 ea	0	0.48M tok
Smoke Gemini / Codex / OpenCode	1.0 ea	1 ea	0	~0

The Daily Repository Chronicle stands alone at the top by blocked-request count (40), though it has zero risky or poor-control nodes. The blocked requests are consistent with expected firewall behavior for a read-heavy general automation workflow.

Episode Regressions

All 49 episodes are kind: standalone (no DAG chains or delegated workers detected). No edge lineage was present in the data.

Behavioral regressions observed within workflow cohorts:

Workflow	Classification	Pattern
Smoke Claude	`changed` (all 3 runs)	Consistent `resource_heavy` HIGH + `partially_reducible` MEDIUM across all runs; one run adds `poor_agentic_control` MEDIUM
Smoke Copilot	`changed` (2/3 runs), `stable` (1)	Consistent `resource_heavy` HIGH + `poor_agentic_control` MEDIUM across all 3 runs
Design Decision Gate 🏗️	`changed` (all 3 runs)	Behavior delta detected vs baseline; cost stable ~$0.30-0.37/run; no safety concern
Test Quality Sentinel	`changed` (1/3 runs)	Isolated change, other 2 stable

The Smoke Claude and Smoke Copilot regressions are the primary operational finding. Both workflows are persistently expensive and weakly controlled relative to their smoke-test purpose.

Visual Diagnostics

1. Episode Risk-Cost Frontier

Decision: Smoke Copilot, Smoke Claude, and The Daily Repository Chronicle dominate the frontier — expensive and/or high-risk, requiring optimization before the next batch.

Why it matters: Smoke Claude and Smoke Copilot are structurally write-heavy and exploratory for a validation workflow; their cost is driven by inference depth, not task complexity. The Daily Repository Chronicle's risk score is entirely blocked-request driven — firewall behavior, not agent misbehavior.

2. Workflow Stability Matrix

Decision: Smoke Copilot and Slide Deck Maintainer are the most unstable workflows; Smoke CI and most engine-specific smokes are flat-stable.

Why it matters: The instability is concentrated, not broad — 2–3 workflows account for almost all the resource-heavy and poor-control signal. The rest of the portfolio is well-behaved.

3. Repository Portfolio Map

Decision: Smoke CI, Smoke Codex, Smoke Gemini, Smoke Crush, Smoke OpenCode belong in keep; Design Decision Gate and Test Quality Sentinel in optimize; Smoke Copilot, Smoke Claude, Slide Deck Maintainer, and The Daily Repository Chronicle in review.

Why it matters: The review quadrant (high cost, low value proxy) contains the two escalation-threshold workflows plus two additional heavy consumers. None of them are justified by their current value signal at this cost level.

4. Workflow Overlap Matrix

Decision: The smoke-test family (Smoke CI, Smoke Claude, Smoke Codex, Smoke Copilot, etc.) has high intra-cluster overlap — they share trigger, task domain, and behavioral fingerprint — but are likely intentional per-engine coverage, not redundancy.

Why it matters: The overlap is real but structurally motivated. Consolidation would reduce per-engine visibility. The more actionable question is whether Smoke Claude and Smoke Copilot need to be so heavyweight within their engine smoke slot.

Portfolio Opportunities

Repository Portfolio Detail

keep — cheap, stable, high repeat use:

Smoke CI (13 runs, $0 cost, all stable general_automation) — ideal candidate for deterministic replacement or minimal-model configuration. High repeat use justifies retention.
Smoke Codex / Gemini / Crush / OpenCode — similar profile; lean and directed.
Changeset Generator — cheap, lean. One error per run (engine-level, not agent misbehavior).

optimize — moderate cost, good value:

Design Decision Gate 🏗️ — $0.30-0.37/run, 3 runs, all changed vs baseline. Cost is justified for a code-review/architecture gate. Consider cohort-match review to understand why it always diffs from baseline.
Test Quality Sentinel — 0.57M token proxy, read_only, stable. Good value but could benefit from model downgrade check.
Agent Container Smoke Test — 0.48M token proxy per episode, 4 blocked requests/run. Stable and justified as integration validation.

review — high cost, weak current value signal:

Smoke Copilot — 3.37M token proxy (highest in portfolio), write_heavy, exploratory, triage domain, all runs flagged resource_heavy HIGH + poor_agentic_control MEDIUM. Smoke tests should not be write-heavy or exploratory. Strongest simplification candidate.
Smoke Claude — 2.46M token proxy ($2.46 USD total), write_heavy, exploratory, 3 runs. Same structural problem as Smoke Copilot — heavyweight behavior for a smoke validation task.
The Daily Repository Chronicle — 2.22M token proxy, 40 blocked requests/run, resource_heavy HIGH. Read-only general automation with heavy token footprint. The 40 blocked requests suggest it probes many domains; partially reducible (50% data-gathering turns).
Slide Deck Maintainer — 1.72M token proxy, 30 turns, exploratory, read_only, resource_heavy HIGH + poor_agentic_control MEDIUM + model_downgrade_available LOW. Single run; heavy for repo maintenance. Strong model-downgrade candidate.

Stale/overlap candidates:

The smoke workflow cluster (7+ workflows) all trigger on push with identical behavioral fingerprints. If per-engine coverage is the goal, the cluster is justified. If not, 3-4 representative engine tests would cover the same failure surface.

Recommended Actions

Priority	Workflow	Action
🔴 High	Smoke Copilot	Reduce write-heavy/exploratory behavior. Add turn limits or tighten instructions. 3/3 runs flagged resource-heavy HIGH + poor-control MEDIUM.
🔴 High	Smoke Claude	Same as above. The partially_reducible signal (62–69% data-gathering) suggests pre-steps can offload context collection.
🟠 Medium	Slide Deck Maintainer	Try `model_downgrade`: `engine.model: claude-haiku-4-5` or `gpt-4.1-mini`. 30 turns, read-only, no writes — does not need a frontier model.
🟠 Medium	The Daily Repository Chronicle	Review the 40 blocked requests per run. If those domains are expected, whitelist them. If not, tighten network scope. Consider moving 50% data-gathering to deterministic pre-steps.
🟡 Low	Issue Monster	Apply `model_downgrade` recommendation. Low-severity, single run, but the signal is clear for `issue_response` + `read_only` + `moderate` profile.
🟡 Low	Design Decision Gate 🏗️	Investigate why all 3 runs classify as `changed` vs baseline. No safety concern, but persistent drift from cohort match suggests prompt or task-shape instability.

Deterministic replacement candidates (consider removing agentic layer entirely):

Smoke CI — 13 runs, all directed + read_only + lean + general_automation. A deterministic shell script or composite action would be faster and cheaper.
Changeset Generator — same profile, with consistent 1 error/run (engine failing, not agent logic).

Per-Workflow Run Detail

Smoke Claude (3 runs, all 2026-04-21)

Runs: §24732851631, §24733695961, §24735013438
Domain: Code Fix / Triage | Engine: Claude Code | Cost: $0.65–$0.94/run
All runs: resource_heavy_for_domain HIGH (turns 29–39, tool_types 24–26, write_actions 20–24)
All runs: partially_reducible MEDIUM (62–69% data-gathering turns)
One run: poor_agentic_control MEDIUM (friction=0, exploratory, write_heavy)
Classification: changed vs baseline on all 3 runs

Smoke Copilot (3 runs, all 2026-04-21)

Runs: §24732851576, §24733696225, §24735013467
Domain: Triage | Engine: GitHub Copilot CLI | Tokens: ~1M–1.25M effective/run
All runs: resource_heavy_for_domain HIGH (turns 15–20, write_actions 28–30, duration 10–14m)
All runs: poor_agentic_control MEDIUM (friction=0, exploratory, write_heavy)
Classification: changed (2 runs) / stable (1 run)

The Daily Repository Chronicle (1 run)

Run: §24732069660
Domain: General Automation | Engine: GitHub Copilot CLI | Tokens: 2.22M effective
resource_heavy_for_domain HIGH (turns=32, duration=10m56s, write_actions=0)
partially_reducible LOW (50% data-gathering, agentic_fraction=0.50)
40 blocked requests — highest in dataset
Classification: no baseline (first or infrequent run)

Slide Deck Maintainer (1 run)

Run: §24734277540
Domain: Repo Maintenance | Engine: GitHub Copilot CLI | Tokens: 1.72M effective
resource_heavy_for_domain HIGH (turns=30, duration=8m42s, write_actions=0)
poor_agentic_control MEDIUM | partially_reducible LOW | model_downgrade_available LOW
Classification: no baseline

References:

§24735013438 — Smoke Claude (most recent, changed + resource_heavy + poor_control)
§24735013467 — Smoke Copilot (most recent, changed + resource_heavy + poor_control)
§24732069660 — The Daily Repository Chronicle (40 blocked requests, resource_heavy)

Generated by Agentic Observability Kit · ● 1.9M · ◷

expires on Apr 28, 2026, 5:14 PM UTC

2026-04-23T07:36:06Z

github-actions[bot]
Bot Apr 23, 2026
Author

This discussion has been marked as outdated by Agentic Observability Kit.

A newer discussion is available at Discussion #28011.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[observability] Agentic Observability Report — 2026-04-21 #27636

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[observability] Agentic Observability Report — 2026-04-21 #27636

Uh oh!

github-actions[bot] Bot Apr 21, 2026

Executive Summary

Key Metrics

Highest Risk Episodes

Episode Regressions

Visual Diagnostics

1. Episode Risk-Cost Frontier

2. Workflow Stability Matrix

3. Repository Portfolio Map

4. Workflow Overlap Matrix

Portfolio Opportunities

Recommended Actions

Replies: 1 comment

Uh oh!

github-actions[bot] Bot Apr 23, 2026 Author

github-actions[bot]
Bot Apr 21, 2026

github-actions[bot]
Bot Apr 23, 2026
Author