[observability] Agentic Observability Report — 2026-03-24 to 2026-04-23 #28011
Closed
Replies: 1 comment
-
|
This discussion has been marked as outdated by Agentic Observability Kit. A newer discussion is available at Discussion #28682. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Date range analyzed: 2026-03-24 → 2026-04-23 (30 days)
Workflow run: §24822586404
Executive Summary
46 runs analyzed across 21 distinct workflows. All 46 episodes are standalone (no multi-run DAGs detected;
edges[]is empty). Episode confidence is uniformly high. Total spend is ~$6.04 in estimated cost (where available) and 231 action-minutes.The primary concerns are concentrated in the Smoke test family and a handful of scheduled heavy workflows:
resource_heavy_for_domain(high severity),poor_agentic_control(medium severity), andwrite_heavyactuation — unusual posture for a smoke test.research-domain workflow withresource_heavy_for_domain(high severity) and a partially-reducible pre-step opportunity.turns_increase/turns_decreasedrift — no escalation threshold crossed.No episode has
escalation_eligible == true. No escalation issue is created.Key Metrics
cls_label == riskyresource_heavy_for_domain(any severity)poor_agentic_control(medium+)partially_reducibleoverkill_for_agenticmodel_downgrade_availableHighest Risk Episodes
resource_heavy+poor_agentic_control(med) +write_heavyactuationresource_heavy+poor_agentic_control(med) +write_heavyactuationresource_heavy, selective_writeresource_heavy(high),partially_reducibleSmoke Claude / Smoke Copilot are the most concerning: these are smoke tests that should be lean and narrow, but both showed
write_heavyactuation (safe outputs: PR comments, reviews, code pushes, issue creation) and high resource consumption. Thepoor_agentic_controlassessment indicates the control loop quality is below expectations for a smoke test domain.Note: No workflow cleared the escalation thresholds (2+ risky-labeled runs, 2+ new MCP failures, 2+ blocked-request increases, or 2+ high-severity resource/control assessments for the same workflow within 14 days). Smoke Claude has 2
resource_heavy_for_domainhigh assessments and 2poor_agentic_controlmedium assessments — borderline. These are cross-engine comparison smoke tests, so some write behavior may be intentional, but the posture warrants owner review.Episode Regressions
Test Quality Sentinel shows execution drift: 4–14 turns across 8 runs (avg 6.9), with several
turns_increaseandturns_decreaseclassifications against cohort baselines. This indicates changing task shape or prompt instability rather than a consistent failure mode. Severity is low-to-medium.Design Decision Gate shows similar turn variability (3
turns_increase+ 3turns_decreasevs. cohort) but is stable in cost and control quality. Drift here is benign.Smoke CI had one
turns_increaserun (24819282263) against a stable cohort — minor.Visual Diagnostics
1. Episode Risk-Cost Frontier
Decision: Smoke Claude, Failure Investigator, and Smoke Copilot occupy the Pareto frontier — expensive and risky relative to the repository.
Why it matters:
estimated_costis sparse (zeroed for most workflows), so action-minutes serves as the cost proxy. Smoke Claude and Smoke Copilot combine moderate cost with the highest risk scores, driven by write-heavy actuation and poor-control assessments. The Failure Investigator is cost-heavy but lower-risk, making it an optimization candidate rather than an immediate control concern.2. Workflow Stability Matrix
Decision: Smoke Claude and Smoke Copilot are the least stable workflows; the broad majority of the portfolio is clean across all six stability dimensions.
Why it matters: Resource-heavy rate and poor-control rate are concentrated in the Smoke engine family, not spread across unrelated workflows. The repository does not have broad instability — it has two point sources that need attention. All scheduled and PR-triggered operational workflows (CI, Quality Sentinel, Design Gate) read as stable or low-noise.
3. Repository Portfolio Map
Decision: Design Decision Gate, Test Quality Sentinel, and Smoke CI belong in optimize (high value, high run count); Smoke Claude and Smoke Copilot belong in review (high cost, low value proxy given instability).
Why it matters: The portfolio is healthy in its core PR-automation cluster. The
reviewquadrant contains workflows that are either resource-heavy smoke tests (Smoke Claude/Copilot), infrequently run heavy schedulers (Failure Investigator, Daily CLI Tester, Code Simplifier, Daily CLI Performance Agent), or one-time non-recurring runs. Thesimplifyquadrant (Auto-Triage, PR Triage Agent, Contribution Check) contains workflows that may benefit from model downgrades or deterministic pre-filtering.4. Workflow Overlap Matrix
Decision: The Smoke engine family (Claude, Copilot, Codex, OpenCode, Crush, Gemini) forms the strongest overlap cluster; Design Decision Gate and Test Quality Sentinel show moderate co-occurrence but serve distinct purposes.
Why it matters: Overlap within the Smoke family is intentional and structural — they are cross-engine regression tests. The overlap between Design Gate and Test Quality Sentinel is weak (different domains, different engines) and does not suggest consolidation. No unintentional duplications are detected in this window.
Portfolio Opportunities
Portfolio details by quadrant
Review (high cost, low value proxy):
partially_reduciblesignal suggests structured log pre-filtering (deterministic) could reduce token overhead before the agentic layer.Simplify (low cost, lower value proxy):
overkill_for_agentic(low severity). A deterministic label-matching rule or a smaller model could handle this task at lower overhead.partially_reducible+model_downgrade_available. Good candidate for a lighter model; triage tasks rarely need full reasoning depth.resource_heavy_for_domain(high severity) despite being a scheduled check. If it fires infrequently, this is acceptable; if frequent, a focused deterministic check would outperform it.Optimize (high value, high run count):
Recommended Actions
Smoke Claude / Smoke Copilot — Review the write-capable tool list. Both workflows exhibit
write_heavyactuation andpoor_agentic_controlsignals. If write actions are intentional test coverage, add explicit scope guards. If not, remove write-capable tools from the smoke test configuration. (Priority: medium — 2 runs each, patterns consistent)Smoke Crush — Investigate the 100% failure rate. 2/2 runs failed. Check whether the Crush engine endpoint is misconfigured or whether a schema change broke the smoke test. (Priority: high — reliability hotspot)
Daily CLI Tools Exploratory Tester — Audit the network allowlist. 49% blocked-request rate suggests the workflow is routinely attempting out-of-scope domains. Either narrow the tool's exploration directives or expand the allowlist deliberately. (Priority: medium)
[aw] Failure Investigator (6h) — Add a deterministic log-download and pre-filter step before the agentic layer to reduce token load. The
partially_reduciblesignal at a $1.89/run cost makes this the highest-value optimization target in the repository. (Priority: low — single run in 30 days)Test Quality Sentinel — Investigate prompt or task-shape changes causing 4–14 turn variance. A stable prompt with consistent input would yield more reliable cohort baselines. (Priority: low — no control failures)
Per-workflow detail
Observability note:
edges[]is empty across all 46 runs — no cross-run lineage was established in this 30-day window. All episodes are treated as standalone by the deterministic model, which is consistent with the repository's primarily PR-event and schedule-triggered architecture. No multi-run DAG chains were missed.Cost note:
estimated_costis zero for the majority of workflows (Copilot engine costs are not surfaced). All efficiency comparisons useaction_minutesas the primary cost proxy. Claude-engine workflows ($0.21–$1.89/run) are the only ones with non-zero estimated cost.References:
Beta Was this translation helpful? Give feedback.
All reactions