[observability] Agentic Workflow Observability Report — 2026-04-20 #27292

2026-04-20T09:07:43Z

github-actions[bot]
Bot Apr 20, 2026

Observability window covers 2026-04-20 02:20Z – 09:00Z (~7 hours). A 14-day window was requested but the data collection hit a timeout after 50 runs. A continuation cursor is available for deeper historical analysis. All episodes are standalone with no lineage edges — no orchestrated DAGs were detected in this snapshot.

Executive Summary

50 runs across 35 workflows were analyzed. No runs were classified risky; 4 were changed (all due to turns_increase). The dominant operational signal is pervasive resource heaviness: 17 of 50 runs (34%) carry a high-severity resource_heavy_for_domain assessment, distributed across 13 distinct workflows. Three workflows show poor_agentic_control. One workflow — Design Decision Gate 🏗️ — crosses the escalation threshold with 2 consecutive runs showing high resource heaviness and increasing turn counts.

A separate concern: Daily CLI Tools Exploratory Tester logged 42 blocked network requests in a single run, and Daily CLI Performance Agent logged 19. These blocked-request counts stand out and warrant access-policy review.

Key Metrics

Metric	Value
Workflows analyzed	35
Runs analyzed	50
Episodes analyzed	50 (all standalone, 0 edges)
High-confidence episodes	50
Runs classified `risky`	0
Runs classified `changed`	4
Runs with high-severity `resource_heavy_for_domain`	16
Runs with medium/high `poor_agentic_control`	3
Runs with `partially_reducible` (medium+)	2
Total tokens consumed	~28.8M
Total estimated cost	$8.54
Total action minutes	301 min
Engines	copilot 28, claude 9, codex 7, none 6
Escalation-eligible episodes (model)	0
Workflows crossing fallback threshold	1 (Design Decision Gate 🏗️)

Highest Risk Episodes

Design Decision Gate 🏗️ — Issue Response / General Automation, claude engine

This is the only workflow crossing an escalation threshold. Across 3 runs today, turns escalated from 4 → 8 → 10 with cohort-matched baselines confirming turns_increase on the last two. Both later runs carry high resource_heavy_for_domain and the most recent also carries medium poor_agentic_control (exploratory execution, broad tool usage, 9 tool types). The agentic fraction is ~50%, meaning half the work could be moved to deterministic pre-steps.

Run	Status	Turns	Assessment
§24648171195	✅	4	baseline (no assessment)
§24649877750	✅ changed	8	`resource_heavy_for_domain` HIGH
§24650414222	✅ changed	10	`resource_heavy_for_domain` HIGH + `poor_agentic_control` MEDIUM

AI Moderator — codex engine, 6/6 runs blocked

Every single run in the window has at least 1 blocked request. One run additionally carries resource_heavy_for_domain medium. The consistent blocking pattern (not an occasional spike) suggests the workflow is routinely attempting access that the firewall disallows. This is a design-level issue worth reviewing.

Daily CLI Tools Exploratory Tester — copilot engine

Single run §24650053759 logged 42 blocked requests and 28 turns. Assessment: high resource_heavy_for_domain + medium partially_reducible (92% of turns could move to deterministic steps). This run has minimal agentic fraction (0.07) — the task is largely read-only data gathering masquerading as an agentic workflow.

Episode Regressions

Workflow	Signal	Detail
Design Decision Gate 🏗️	`turns_increase` × 2	Turns 4→8→10 across 3 runs, cohort-matched
Auto-Triage Issues	`turns_increase` × 1	One run vs. stable baseline
Test Quality Sentinel	`turns_increase` × 1	One run vs. stable baseline
Daily CLI Performance Agent	19 blocked requests	High network friction
Contribution Check	13 blocked requests	High network friction

Recommended Actions

Design Decision Gate 🏗️ — Tighten instructions, reduce tool breadth from 9 types. Move ~50% data-gathering steps to pre-agent deterministic frontmatter steps. Consider a smaller model (claude-haiku-4-5). An escalation issue has been filed.
Daily CLI Tools Exploratory Tester — With 92% data-gathering turns and 42 blocked requests, this workflow is over-engineered for its task. Restructure as a deterministic script with a thin agentic post-step, and review what domains are being blocked.
Daily CLI Performance Agent — 19 blocked requests indicates persistent network policy friction. Review network.allowed in the workflow frontmatter.
AI Moderator — Investigate why every run results in a blocked request. The 6/6 blocking rate is not noise; it's a systematic access mismatch. Review what the agent is trying to reach and whether it should be allowed.
Documentation Unbloat / Layout Specification Maintainer / jsweep — These are the top token consumers (4.9M, 3.5M, 2.4M tokens). All are single runs but worth monitoring for cost efficiency. Layout Specification Maintainer also shows poor_agentic_control.

Per-Workflow Resource Profile (all 35 workflows)

Workflow	Engine	Runs	Tokens	Cost	Resource Heavy	Poor Control	Blocked
Documentation Unbloat	claude	1	4.9M	$2.40	HIGH	—	1
Layout Specification Maintainer	copilot	1	3.5M	$0.00	HIGH	MEDIUM	—
jsweep - JavaScript Unbloater	copilot	1	2.4M	$0.00	HIGH	—	1
Code Simplifier	copilot	1	2.4M	$0.00	HIGH	—	—
Agent Performance Analyzer	copilot	1	2.1M	$0.00	HIGH	—	—
Contribution Check	copilot	1	1.8M	$0.00	HIGH	—	13
Daily CLI Tools Exploratory Tester	copilot	1	1.5M	$0.00	HIGH	—	42
Agent Persona Explorer	copilot	1	1.4M	$0.00	HIGH	MEDIUM	—
Schema Consistency Checker	claude	1	1.0M	$1.40	HIGH	—	—
Go Fan	claude	1	1.1M	$1.01	HIGH	—	—
Daily CLI Performance Agent	copilot	1	1.0M	$0.00	HIGH	—	19
CLI Version Checker	claude	1	862K	$0.97	HIGH	—	—
Copilot PR Prompt Pattern Analysis	copilot	1	680K	$0.00	HIGH	—	—
Design Decision Gate 🏗️	claude	3	993K	$1.37	HIGH×2	MEDIUM×1	—
[aw] Failure Investigator (6h)	claude	1	304K	$1.39	HIGH	—	—
Auto-Triage Issues	copilot	5	570K	$0.00	—	—	—
Test Quality Sentinel	copilot	3	753K	$0.00	—	—	—
AI Moderator	codex	6	0	$0.00	MEDIUM×1	—	6
Issue Monster	copilot	2	387K	$0.00	—	—	—
Smoke CI	copilot	2	0	$0.00	—	—	—
GPL Dependency Cleaner	copilot	1	493K	$0.00	—	—	—
GitHub Remote MCP Auth Test	copilot	1	100K	$0.00	—	—	—
PR Triage Agent	copilot	1	344K	$0.00	—	—	—
Daily Hippo Learn	copilot	1	209K	$0.00	—	—	—
Bot Detection	copilot	1	76K	$0.00	—	—	—
Schema Feature Coverage Checker	codex	1	0	$0.00	—	—	1
Daily AstroStyleLite Spellcheck	claude	1	0	$0.00	—	—	—
Scout / Q / /cloclo / Grumpy / Security / PR Nitpick	none	6	0	$0.00	—	—	—

Behavioral Fingerprint Breakdown

Profile	Count	Notes
directed / read_only / lean	16	Healthy baseline — no action needed
exploratory / read_only / heavy	12	Primary risk tier — most resource_heavy flags here
directed / read_only / moderate	6	Normal range
exploratory / selective_write / heavy	4	Highest-risk tier — write posture + heavy cost
adaptive / read_only / moderate	3	Normal range
exploratory / read_only / moderate	2	Monitor
directed / read_only / heavy	1	Occasional
unknown	6	Runs with no fingerprint data

The exploratory / selective_write / heavy cohort (4 runs) is the highest-priority group. These runs take broad exploratory action AND write output. Worth checking whether write posture is intentional for each.

Optimization Candidates (low-urgency)

These workflows are consistently lean, directed, and narrow — good candidates for deterministic automation if they require no real inference:

Auto-Triage Issues — 5 runs, all stable, directed/read_only/lean. If the triage logic is rule-based, this could be a deterministic GitHub Actions step.
Smoke CI — 2 runs, lean, no assessments. Appears well-controlled.
Bot Detection — 1 run, lean. If detection rules are static, consider deterministic automation.
Daily AstroStyleLite Markdown Spellcheck — 1 run, 0 tokens, lean. Likely a wrapper — verify it's doing useful work.

Workflows always using latest_success fallback (no cohort match possible yet): Most single-run workflows fall here by definition. As they accumulate history, baseline quality will improve.

Data Collection Notes

The 14-day window returned only 50 runs (all from 2026-04-20). The logs tool hit a timeout after ~2 minutes. A continuation cursor (before_run_id: 24645419031, start_date: 2026-04-06) is available for deeper historical analysis.
All 50 episodes are standalone — no edges[] were populated, meaning no multi-workflow DAGs were detected. This could reflect the true topology (all workflows run independently) or a lineage detection gap if any workflows triggered others.
6 runs have no engine attribution (none), likely slash-command or manual triggers with incomplete metadata.
Agentic fraction data was available for most runs, enabling partially_reducible assessment accuracy.

References:

§24650414222 — Design Decision Gate (latest risky run)
§24650053759 — Daily CLI Tools Exploratory Tester (42 blocked)
§24650923335 — Daily CLI Performance Agent (19 blocked)

Generated by Agentic Observability Kit · ● 887.4K · ◷

expires on Apr 27, 2026, 9:07 AM UTC

2026-04-21T17:14:20Z

github-actions[bot]
Bot Apr 21, 2026
Author

This discussion has been marked as outdated by Agentic Observability Kit.

A newer discussion is available at Discussion #27636.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[observability] Agentic Workflow Observability Report — 2026-04-20 #27292

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[observability] Agentic Workflow Observability Report — 2026-04-20 #27292

Uh oh!

github-actions[bot] Bot Apr 20, 2026

Executive Summary

Key Metrics

Highest Risk Episodes

Episode Regressions

Recommended Actions

Replies: 1 comment

Uh oh!

github-actions[bot] Bot Apr 21, 2026 Author

github-actions[bot]
Bot Apr 20, 2026

github-actions[bot]
Bot Apr 21, 2026
Author