[agentic-optimization-kit] Weekly Agentic Optimization Report — 2026-04-21 to 2026-04-27 #28742
Closed
Replies: 1 comment
-
|
This discussion has been marked as outdated by Agentic Optimization Kit. A newer discussion is available at Discussion #28810. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
📊 Executive Summary
📈 Visual Diagnostics
1. Token Usage by Workflow
Decision: Contribution Check leads at 5.4M tokens (6 runs), followed by Test Quality Sentinel (3.8M, 16 runs) and Daily Syntax Error Quality Check (3.4M, 1 run) — all flagged expensive-per-run (>100k avg); the distribution is broad across 61 workflows with no single dominant consumer.
2. Historical Token Trend
Decision: Token consumption is trending downward week-over-week (76.6M on Apr 22 → 67.7M on Apr 27), suggesting recent optimizations on Test Quality Sentinel, Smoke CI, and Architecture Guardian are taking effect.
3. Episode Risk–Cost Frontier
Decision: All 54 analyzed episodes scored risk=0 with zero estimated cost — the 30-day window shows a stable, low-risk portfolio with no Pareto-frontier outliers.
Why it matters: Zero risk scores confirm that recent investments in reliability and tool-scoping are holding; the cost axis is suppressed by Copilot billing, but the risk axis is the primary safety signal.
4. Workflow Stability Matrix
Decision: Instability is concentrated in the Smoke CI and Visual Regression Checker workflows which drive 100% and 140% incident rates respectively (errors > runs) — all other workflows show clean stability profiles.
Why it matters: Smoke CI's error concentration (22 errors / 22 runs) is a known issue from trigger misconfiguration; Visual Regression Checker has infrastructure dependencies that cause intermittent failures. Neither indicates agentic control degradation.
5. Repository Portfolio Map
Decision: High-value/low-cost quadrant (KEEP) holds Test Quality Sentinel and Issue Monster; high-value/high-cost (OPTIMIZE) holds Contribution Check and daily analysis workflows; most single-run workflows cluster in SIMPLIFY/REVIEW.
Why it matters: The dominant tradeoff is between high-frequency low-cost workflows (high ROI) vs. single-run deep-analysis workflows (high cost, uncertain recurrence value) — the OPTIMIZE quadrant is the primary cost lever.
🚨 Escalation Targets
No workflows crossed escalation thresholds in the 14-day analysis window. Zero episodes marked
escalation_eligible. No repeated risky classifications, new MCP failures, or blocked request increases detected.🎯 Optimization Target: Daily Syntax Error Quality Check
Why selected: Highest total tokens (3,399,393) among workflows not optimized in the last 14 days; not a self-targeting workflow.
Runs analyzed: 1 over 7 days | Avg tokens/run: 3,399,393 | Avg cost/run: $0.00 (Copilot) | Avg turns/run: 56
Key signals from assessment:
resource_heavy_for_domain— high severity (56 turns for a 2-workflow test; expected ≤35)partially_reducible—agentic_fraction=0.50, meaning ~28 turns are pure data-gathering suitable for deterministic pre-stepssteps:pre-step runningfind .github/workflows -name '*.md' ! -name 'daily-*' ! -name '*-test.md'and writing results to/tmp/gh-aw/agent/candidates.txt; agent reads file directlymax-turns: 35cap to prevent turn runawaytimeout-minutes: 15+ prompt instructionComplete in ≤ 35 turnsgh extension install/upgrade gh-awblock to a workflowsteps:pre-step so it doesn't count against agentic turns💡 5 Actionable Prompts
🔧 Prompt 1 — Optimization (Daily Syntax Error Quality Check)
🛡️ Prompt 2 — Stability Fix (Smoke CI)
🔀 Prompt 3 — Consolidation (Daily Analysis Workflows)
✂️ Prompt 4 — Right-sizing (Package Specification Extractor)
🚀 Prompt 5 — Portfolio Maintenance (REVIEW-quadrant workflows)
Full Per-Workflow Baseline Breakdown (7 days)
Episode Detail (30-day)
All 54 analyzed episodes returned
episode_risk_score = 0andescalation_eligible = false. This reflects the 30-day window being dominated by standalone episodes (1 run each) with clean execution profiles — no MCP failures, no blocked requests, no risky classifications, and no control degradation signals.The 30-day dataset returned 54 episodes from 54 runs (all standalone). The 7-day baseline contains 119 runs, indicating episode stitching operates on a longer lookback than the download window captured. No multi-run episodes with compound risk signals were detected in the available data.
Portfolio Opportunities
Consolidation candidates (same domain + schedule + behavioral fingerprint):
Stale workflows (1 run in 7 days, no errors, low turn count — candidates for weekly cadence):
Overkill candidates (not selected as optimization target this week):
Optimization Analysis Detail: Daily Syntax Error Quality Check
Workflow:
daily-syntax-error-quality.mdRun analyzed: §24991120216
Tokens: 3,399,393 | Turns: 56 | Duration: 9m1s | Action minutes: 10
Area 1: Tool Usage
Only 1 run available (below the ≥5 run threshold for tool-removal recommendations). Tools configured:
bash(5 allowlisted commands: find, compile, cat, head, cp, mkdir),mount-as-clis: true,safe-outputs: create-issue.All configured tools are used by design. No tool removal recommended with current sample size.
Area 2: Token Efficiency
56 turns for a 2-workflow compiler test is 40-60% above expected for this task shape (estimate: 30-35 turns).
agentic_fraction=0.50means ~28 turns are read-gather operations. Cache hit rate: not available from single run. The workflow prompt is ~400 lines, with ~150 lines devoted to scoring rubrics that are read but not iteratively consulted.Key waste drivers:
Opportunity: Move Phase 1 to a deterministic pre-step (save ~8-10 turns).
Area 3: Reliability
0 errors, 0 warnings. The
resource_heavy_for_domainassessment is HIGH severity — this is a turn-count signal, not a functional failure. Thepartially_reducibleassessment is LOW severity. The workflow has not failed in the observed run.Risk: Without a
max-turnscap, turn runaway is possible if the agent encounters unexpected compiler output or retries. The currenttimeout-minutes: 20provides a safety net but allows up to ~80 turns at ~15s/turn.Area 4: Prompt Efficiency
The prompt has several over-specified sections:
Prompt reduction target: ~200 lines → saves approximately 50-80k tokens in prompt context per run.
References:
Beta Was this translation helpful? Give feedback.
All reactions