You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
No risky runs, no escalation-eligible episodes. Repository is operating within safe operational bounds.
📈 Visual Diagnostics
1. Token Usage by Workflow
Decision: Test Quality Sentinel and Contribution Check lead at 6.2M tokens each; every workflow is below the 30% dominance threshold, indicating a healthy distribution.
2. Historical Token Trend
Decision: 18 daily data points are available; the token trend is rising week-over-week as more workflows are added, but per-run averages remain stable.
3. Episode Risk–Cost Frontier
Decision: Agentic Optimization Kit and Agentic Observability Kit sit at the frontier — highest token cost AND highest composite risk scores driven by 50 blocked requests and resource-heavy/poor-control assessments. Why it matters: Cost and risk are co-located in the same two workflows; addressing either one also reduces the other. No escalation threshold has been crossed, but both are strong optimization candidates.
4. Workflow Stability Matrix
Decision: Instability is concentrated in single-run workflows (Agent Persona Explorer, Agentic Observability Kit, Draft PR Cleanup) — these show high resource-heavy rates but zero risky or MCP-failure signals. Why it matters: Single-run instability scores are noisy; only Test Quality Sentinel (20 runs, consistently partially_reducible) provides statistically reliable signal for action.
5. Repository Portfolio Map
Decision: 32 workflows in simplify, 36 in review, 5 in keep, 1 in optimize (Test Quality Sentinel). Most single-run workflows land in review due to insufficient run history. Why it matters: The dominant portfolio tradeoff is run-frequency vs. token cost: low-frequency expensive workflows are the primary levers; high-frequency cheap ones (Smoke CI, small triage agents) are already right-sized.
🚨 Escalation Targets
No escalation thresholds crossed. Zero risky run classifications, zero escalation-eligible episodes, zero MCP failures, zero new blocked-request increases across all 141 episodes. Repository is clean.
🎯 Optimization Target: Agentic Observability Kit
Why selected: Highest total tokens among non-recently-optimized, non-self-referential workflows (1,986,286 tokens in 1 run). Carries 4 distinct agentic assessments: resource_heavy_for_domain (high), poor_agentic_control (medium), partially_reducible (medium), model_downgrade_available (low).
Optimize the workflow file `.github/workflows/agentic-observability-kit.md` to reduce token usage by ~600,000 tokens/run.
The run (ID: 24986097479) used 1,986,286 tokens in 33 turns with cache efficiency of 49%. The `agentic_fraction` is only 0.06, meaning 94% of work is deterministic data collection.
Make these changes:
1. Add a `pre-steps:` section that runs `agenticworkflows logs --count 200 --start_date -30d > /tmp/gh-aw/agent/logs30d.json` before the agent starts. This eliminates the two in-agent `logs` MCP calls, saving ~400,000 tokens.
2. Update the agent prompt to reference `/tmp/gh-aw/agent/logs30d.json` instead of calling `logs` directly.
3. Reduce `count` from 400 to 200 — the repository has ~140 runs/week, so 200 covers the 30-day window adequately.
Expected savings: ~600,000 tokens/run (~30% reduction).
Evidence: run 24986097479 shows 2 `logs` MCP tool calls consuming the bulk of context.
Fix the `partially_reducible` pattern in `.github/workflows/test-quality-sentinel.md`. This workflow has 10 out of 20 runs flagged `partially_reducible` with `agentic_fraction` below 0.5.
The pattern: the agent is spending turns fetching test run data, listing files, and reading PR metadata — all of which are deterministic operations.
Make these changes:
1. Add `pre-steps:` to fetch test run artifacts and PR metadata into `/tmp/gh-aw/agent/` before the agent starts.
2. Pass the pre-fetched data paths in the agent prompt so the agent only performs analysis and judgment turns.
3. This should reduce average turns from 7.05 to 3–4, saving ~150,000 tokens/run × 20 runs = ~3,000,000 tokens/week.
Evidence: 10 runs flagged `partially_reducible` across 7 days, consistent across all run attempts.
Consolidate `issue-monster.md` and `auto-triage-issues.md` — both operate in the `issue_response`/`triage` domain with overlapping behavior fingerprints (exploratory execution, selective_write actuation, standalone dispatch).
Overlap signals:
- Same task domain family (issue handling)
- Both have `partially_reducible` and `overkill_for_agentic` assessments
- Auto-Triage Issues has 4 `overkill_for_agentic` flags — the strongest overkill signal in the repository
Action:
1. Keep `issue-monster.md` as the base (broader capability set).
2. Absorb the triage-routing logic from `auto-triage-issues.md` into a conditional section in `issue-monster.md`.
3. Retire `auto-triage-issues.md` and point its trigger to `issue-monster.md`.
Expected benefit: Eliminates one full workflow's token footprint (~400k tokens/run) and removes the top `overkill_for_agentic` offender.
Right-size `.github/workflows/auto-triage-issues.md`. This workflow is the only one in the repository with repeated `overkill_for_agentic` assessments (4 runs flagged).
Why it's overkill:
- Domain: `triage` — the expected outcome is a label assignment or routing decision
- Tool breadth: broad (GitHub tools + MCP)
- Cost profile: high turns relative to task complexity
- `agentic_fraction` suggests most turns are retrieving context that could be passed deterministically
Recommended right-sizing:
1. Replace the AI agent with a deterministic GitHub Actions step using `actions/github-script` that reads issue labels and body keywords, then applies routing rules.
2. Reserve the AI agent only for ambiguous issues (no matching keywords) — add an `if:` condition to skip the agent when keyword matching succeeds.
3. Expected reduction: eliminate AI agent usage for ~70% of invocations, saving the full ~400k tokens/run for those cases.
Audit the 36 workflows currently in the `review` quadrant (low value proxy, high avg token cost) of the repository portfolio map.
These workflows share a common pattern: they ran only once in the last 7 days and have no repeat-use history that establishes value. Many are in the `repo_maintenance` domain (Glossary Maintainer, Layout Specification Maintainer, Daily Community Attribution Updater, Slide Deck Maintainer).
Action plan for the repository owner:
1. For each `repo_maintenance` workflow that ran once: check if the output (PR, discussion, comment) had any human follow-up. Workflows with no follow-up signal low value.
2. For workflows where the output was auto-merged or auto-closed without review, convert to deterministic automation (shell script + cron) to eliminate AI inference cost.
3. Priority candidates for conversion: any workflow with `actuation_style: selective_write`, `agentic_fraction < 0.3`, and only 1 run in the window — these are doing mostly scripted work with unnecessary AI overhead.
This audit could identify 10–15 workflows for retirement or downgrade, potentially saving 5–8M tokens/week.
Full Per-Workflow Baseline Breakdown (7 days)
Workflow
Runs
Avg Tokens/Run
Total Tokens
Avg Turns
Errors
Warnings
Flags
Test Quality Sentinel
20
312,279
6,245,581
7.1
0
0
expensive
Contribution Check
6
1,035,602
6,213,609
32.0
0
0
expensive
Daily Syntax Error Quality Check
1
3,399,393
3,399,393
56.0
0
0
expensive
Agentic Optimization Kit
1
3,358,323
3,358,323
52.0
0
0
expensive
Daily Community Attribution Updater
1
3,239,894
3,239,894
54.0
0
0
expensive
Package Specification Extractor
1
3,003,796
3,003,796
51.0
0
0
expensive
Documentation Noob Tester
1
2,851,307
2,851,307
54.0
0
0
expensive
Daily Firewall Logs Collector and Reporter
1
2,486,927
2,486,927
35.0
0
0
expensive
Agentic Observability Kit
1
1,986,286
1,986,286
33.0
0
0
expensive
The Daily Repository Chronicle
1
1,942,276
1,942,276
35.0
0
0
expensive
Architecture Diagram Generator
1
1,912,218
1,912,218
32.0
0
0
expensive
Copilot PR Conversation NLP Analysis
1
1,885,397
1,885,397
31.0
0
0
expensive
Weekly Safe Outputs Specification Review
1
1,881,993
1,881,993
36.0
0
0
expensive
Code Simplifier
1
1,713,521
1,713,521
27.0
0
0
expensive
Dead Code Removal Agent
1
1,610,847
1,610,847
27.0
0
0
expensive
Draft PR Cleanup
1
1,016,517
1,016,517
9.0
0
0
expensive
Super Linter Report
1
919,046
919,046
14.0
0
0
expensive
Organization Health Report
1
848,419
848,419
17.0
0
0
expensive
Workflow Skill Extractor
3
272,869
818,607
9.0
0
0
expensive
Refactoring Cadence
1
757,823
757,823
16.0
0
0
expensive
PR Triage Agent
4
172,434
689,736
6.5
0
0
expensive
Issue Monster
3
223,244
669,731
11.7
0
0
expensive
Auto-Triage Issues
4
153,178
612,710
8.0
0
0
expensive
Issue Triage Agent
3
189,773
569,318
7.7
0
0
expensive
Architecture Guardian
1
548,340
548,340
8.0
0
0
expensive
Daily Observability Report
3
163,459
490,378
7.0
0
0
expensive
Daily Hippo Learn
1
460,267
460,267
8.0
0
0
expensive
Daily Regulatory Report Generator
1
447,200
447,200
9.0
0
0
expensive
Daily MCP Tool Concurrency Analysis
1
419,124
419,124
8.0
0
0
expensive
Smoke CI
14
27,282
381,954
1.5
0
0
normal
Copilot CLI Deep Research Agent
1
373,978
373,978
6.0
0
0
expensive
Daily Workflow Updater
1
355,617
355,617
7.0
0
0
expensive
Slide Deck Maintainer
1
340,069
340,069
5.0
0
0
expensive
Delight
1
338,097
338,097
5.0
0
0
expensive
Daily DIFC Integrity-Filtered Events Analyzer
1
328,929
328,929
5.0
0
0
expensive
Copilot Token Usage Optimizer
1
321,777
321,777
5.0
0
0
expensive
Layout Specification Maintainer
1
288,742
288,742
7.0
0
0
expensive
Daily Copilot PR Merged Report
1
285,750
285,750
4.0
0
0
expensive
Daily Malicious Code Scan Agent
1
282,803
282,803
4.0
0
0
expensive
Visual Regression Checker
1
275,380
275,380
6.0
0
0
expensive
MCP Inspector Agent
1
252,619
252,619
5.0
0
0
expensive
Glossary Maintainer
1
244,760
244,760
5.0
0
0
expensive
Weekly Issue Summary
1
237,668
237,668
3.0
0
0
expensive
Agent Persona Explorer
1
221,756
221,756
3.0
0
0
expensive
Daily Compiler Quality Check
3
73,175
219,524
2.7
0
0
normal
Copilot Opt
1
193,543
193,543
3.0
0
0
expensive
Sub-Issue Closer
1
187,498
187,498
3.0
0
0
expensive
Daily Observability Diff
3
57,985
173,955
2.7
0
0
normal
Metrics Collector - Infrastructure Agent
1
155,888
155,888
2.0
0
0
expensive
Daily DIFC Encrypted Channels Analyzer
1
138,083
138,083
2.0
0
0
expensive
Episode Detail
All 141 episodes are standalone with high confidence. No multi-run episodes or DAG lineage detected — each workflow run is independent with no shared lineage markers.
Top episodes by composite risk score (weighted: risky×1.0, poor_control×1.2, mcp_fail×1.2, blocked×1.0, new_mcp_fail×1.4, blocked_increase×1.4, escalation×2.0):
Episode (Primary Workflow)
Risk Score
Tokens
Blocked Reqs
Resource Heavy
Poor Control
Escalation
Agentic Optimization Kit
64.0
3,358,323
50
✅
✅
❌
Agentic Observability Kit
51.2
1,986,286
50
✅
✅
❌
Copilot PR Conversation NLP Analysis
50.0
1,885,397
50
✅
✅
❌
Daily Firewall Logs Collector and Reporter
47.0
2,486,927
50
✅
—
❌
Documentation Noob Tester
47.0
2,851,307
50
✅
—
❌
The blocked_request_count of 50 appearing in multiple high-token workflows suggests a shared network firewall limit is being hit consistently. This is expected behavior (firewall enforcement), not a regression, but it contributes to episode risk scores.
Zero episodes are escalation_eligible. Zero risky run classifications. Zero MCP failures.
Stale workflows (1 run in window, 0 follow-up signals): 66 workflows ran only once. Without a longer window, stale/new cannot be distinguished — revisit in 4 weeks with a 30-day dataset.
Only agenticworkflows.logs, safeoutputs.* tools used. Tool set is minimal and appropriate. However, logs is called twice (once per data slice), consuming ~1.9M input tokens.
Keep all tools; consolidate logs calls into pre-step
Token efficiency
1,963,317 input tokens, 1,883,658 cache reads (49% cache efficiency). Low cache hit rate vs. total context. Two large logs payloads dominate input token cost.
Pre-download data to eliminate runtime MCP payload overhead
poor_agentic_control (medium) + 33 turns suggests the 26KB prompt is generating many back-and-forth data-fetch cycles. agentic_fraction=0.06 confirms only 2 turns are truly agentic judgment.
Break prompt into 3 explicit phases with expected outputs; move 94% of data collection to pre-steps
Assessment details:
resource_heavy_for_domain (high): 1,986,286 tokens for issue_response domain is extremely heavy; typical issue response is 50–200k
poor_agentic_control (medium): 33 turns with only 0.06 agentic fraction indicates many small exploratory turns
partially_reducible (medium): 94% of turns are data-gathering replaceable with deterministic steps
model_downgrade_available (low): model is capable of downgrading for formatting/charting subtasks
References:
§24986097479 — Agentic Observability Kit run analyzed
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
📊 Executive Summary
No risky runs, no escalation-eligible episodes. Repository is operating within safe operational bounds.
📈 Visual Diagnostics
1. Token Usage by Workflow
Decision: Test Quality Sentinel and Contribution Check lead at 6.2M tokens each; every workflow is below the 30% dominance threshold, indicating a healthy distribution.
2. Historical Token Trend
Decision: 18 daily data points are available; the token trend is rising week-over-week as more workflows are added, but per-run averages remain stable.
3. Episode Risk–Cost Frontier
Decision: Agentic Optimization Kit and Agentic Observability Kit sit at the frontier — highest token cost AND highest composite risk scores driven by 50 blocked requests and resource-heavy/poor-control assessments.
Why it matters: Cost and risk are co-located in the same two workflows; addressing either one also reduces the other. No escalation threshold has been crossed, but both are strong optimization candidates.
4. Workflow Stability Matrix
Decision: Instability is concentrated in single-run workflows (Agent Persona Explorer, Agentic Observability Kit, Draft PR Cleanup) — these show high resource-heavy rates but zero risky or MCP-failure signals.
Why it matters: Single-run instability scores are noisy; only Test Quality Sentinel (20 runs, consistently partially_reducible) provides statistically reliable signal for action.
5. Repository Portfolio Map
Decision: 32 workflows in
simplify, 36 inreview, 5 inkeep, 1 inoptimize(Test Quality Sentinel). Most single-run workflows land inreviewdue to insufficient run history.Why it matters: The dominant portfolio tradeoff is run-frequency vs. token cost: low-frequency expensive workflows are the primary levers; high-frequency cheap ones (Smoke CI, small triage agents) are already right-sized.
🚨 Escalation Targets
No escalation thresholds crossed. Zero risky run classifications, zero escalation-eligible episodes, zero MCP failures, zero new blocked-request increases across all 141 episodes. Repository is clean.
🎯 Optimization Target: Agentic Observability Kit
Why selected: Highest total tokens among non-recently-optimized, non-self-referential workflows (1,986,286 tokens in 1 run). Carries 4 distinct agentic assessments:
resource_heavy_for_domain(high),poor_agentic_control(medium),partially_reducible(medium),model_downgrade_available(low).Runs analyzed: 1 over 7 days | Avg tokens/run: 1,986,286 | Avg cost/run: $0.00 | Avg turns/run: 33 | Action minutes: 29 | Cache efficiency: 49%
agenticworkflows logs --count 400 --start_date -30d > /tmp/gh-aw/agent/logs30d.jsonas a pre-step; pass file path to agentlogscount to 200 and remove the duplicatelogscallcount: 400tocount: 200; run shows 2logscalls — merge into one pre-step invocationmodel_downgrade_availableassessment confirms eligibility; add model routing or split heavy analysis to a sub-steppoor_agentic_controlby defining explicit section checkpoints in the promptmcp-scriptspre-step to pre-load baseline data deterministicallyagentic_fraction: 0.06confirms 94% of work is non-agentic; deterministic pre-steps can replace most data collection turns💡 5 Actionable Prompts
🔧 Prompt 1 — Optimization (Agentic Observability Kit, highest ROI)
🛡️ Prompt 2 — Stability Fix (Test Quality Sentinel, repeat partially_reducible)
🔀 Prompt 3 — Consolidation (Issue Monster + Auto-Triage Issues, top overlap pair)
✂️ Prompt 4 — Right-sizing (Auto-Triage Issues, overkill workflow)
🚀 Prompt 5 — Portfolio Maintenance (review-quadrant workflows)
Full Per-Workflow Baseline Breakdown (7 days)
Episode Detail
All 141 episodes are
standalonewithhighconfidence. No multi-run episodes or DAG lineage detected — each workflow run is independent with no shared lineage markers.Top episodes by composite risk score (weighted: risky×1.0, poor_control×1.2, mcp_fail×1.2, blocked×1.0, new_mcp_fail×1.4, blocked_increase×1.4, escalation×2.0):
The blocked_request_count of 50 appearing in multiple high-token workflows suggests a shared network firewall limit is being hit consistently. This is expected behavior (firewall enforcement), not a regression, but it contributes to episode risk scores.
Zero episodes are
escalation_eligible. Zero risky run classifications. Zero MCP failures.Portfolio Opportunities
Overkill workflows (repeated
overkill_for_agenticassessments):Partially reducible workflows (consistent across runs):
Overlap pairs (same domain + behavior cluster):
triagedomain: Auto-Triage Issues ↔ Issue Triage Agent ↔ PR Triage Agent (3-way overlap; all standalone, exploratory, selective_write)issue_responsedomain: Issue Monster ↔ Sub-Issue Closer ↔ MCP Inspector Agent ↔ Agentic Observability Kitrepo_maintenancedomain: Glossary Maintainer ↔ Layout Specification Maintainer ↔ Slide Deck Maintainer (similar output shape and schedule)Stale workflows (1 run in window, 0 follow-up signals): 66 workflows ran only once. Without a longer window, stale/new cannot be distinguished — revisit in 4 weeks with a 30-day dataset.
Optimization Analysis Detail: Agentic Observability Kit
Run analyzed:
24986097479| 2026-04-27 | 1,986,286 tokens | 33 turns | 29 action minutesBehavior fingerprint:
execution_style=exploratory,tool_breadth=narrow,actuation_style=selective_write,resource_profile=heavy,dispatch_mode=standalone,agentic_fraction=0.06Tool usage (from episode tool_calls):
logs(2 calls) →create_discussion(1) →upload_asset(4)agenticworkflows.logs,safeoutputs.*tools used. Tool set is minimal and appropriate. However,logsis called twice (once per data slice), consuming ~1.9M input tokens.logscalls into pre-steplogspayloads dominate input token cost.poor_agentic_control(medium) + 33 turns suggests the 26KB prompt is generating many back-and-forth data-fetch cycles.agentic_fraction=0.06confirms only 2 turns are truly agentic judgment.Assessment details:
resource_heavy_for_domain(high): 1,986,286 tokens forissue_responsedomain is extremely heavy; typical issue response is 50–200kpoor_agentic_control(medium): 33 turns with only 0.06 agentic fraction indicates many small exploratory turnspartially_reducible(medium): 94% of turns are data-gathering replaceable with deterministic stepsmodel_downgrade_available(low): model is capable of downgrading for formatting/charting subtasksReferences:
Beta Was this translation helpful? Give feedback.
All reactions