[observability] Agentic Observability Report — 2026-04-21 #27636
Closed
Replies: 1 comment
-
|
This discussion has been marked as outdated by Agentic Observability Kit. A newer discussion is available at Discussion #28011. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Date range analyzed: 2026-03-22 → 2026-04-21 (30-day window requested; all 49 sampled runs occurred on 2026-04-21 — a single-day batch, likely triggered by a CI push event and workflow suite execution. Trend comparisons are therefore within-day only.)
Executive Summary
49 runs across 15 workflows were analyzed. No episodes were flagged
escalation_eligibleby the deterministic model, but two workflows — Smoke Claude and Smoke Copilot — crossed the manual escalation thresholds via repeated high-severityresource_heavy_for_domainand medium-severitypoor_agentic_controlassessments across all 3 of their runs in the 14-day window.The portfolio is dominated by smoke/CI validation workflows that are cheap, stable, and justified. The operational concerns concentrate in three areas: (1) Smoke Claude and Smoke Copilot are consistently expensive and poorly controlled for their stated purpose; (2) The Daily Repository Chronicle holds the highest token volume in the dataset with 40 blocked requests; and (3) Slide Deck Maintainer shows resource-heavy, exploratory behavior with zero write actions — a strong candidate for model downgrade or deterministic simplification.
The repository has no risky-labeled runs and no MCP failures. The overall safety posture is clean.
Key Metrics
riskyHighest Risk Episodes
No episodes are
escalation_eligibleper the deterministic model. The three highest-composite-risk episodes are:The Daily Repository Chronicle stands alone at the top by blocked-request count (40), though it has zero risky or poor-control nodes. The blocked requests are consistent with expected firewall behavior for a read-heavy general automation workflow.
Episode Regressions
All 49 episodes are
kind: standalone(no DAG chains or delegated workers detected). No edge lineage was present in the data.Behavioral regressions observed within workflow cohorts:
changed(all 3 runs)resource_heavyHIGH +partially_reducibleMEDIUM across all runs; one run addspoor_agentic_controlMEDIUMchanged(2/3 runs),stable(1)resource_heavyHIGH +poor_agentic_controlMEDIUM across all 3 runschanged(all 3 runs)changed(1/3 runs)The Smoke Claude and Smoke Copilot regressions are the primary operational finding. Both workflows are persistently expensive and weakly controlled relative to their smoke-test purpose.
Visual Diagnostics
1. Episode Risk-Cost Frontier
Decision: Smoke Copilot, Smoke Claude, and The Daily Repository Chronicle dominate the frontier — expensive and/or high-risk, requiring optimization before the next batch.
Why it matters: Smoke Claude and Smoke Copilot are structurally write-heavy and exploratory for a validation workflow; their cost is driven by inference depth, not task complexity. The Daily Repository Chronicle's risk score is entirely blocked-request driven — firewall behavior, not agent misbehavior.
2. Workflow Stability Matrix
Decision: Smoke Copilot and Slide Deck Maintainer are the most unstable workflows; Smoke CI and most engine-specific smokes are flat-stable.
Why it matters: The instability is concentrated, not broad — 2–3 workflows account for almost all the resource-heavy and poor-control signal. The rest of the portfolio is well-behaved.
3. Repository Portfolio Map
Decision: Smoke CI, Smoke Codex, Smoke Gemini, Smoke Crush, Smoke OpenCode belong in
keep; Design Decision Gate and Test Quality Sentinel inoptimize; Smoke Copilot, Smoke Claude, Slide Deck Maintainer, and The Daily Repository Chronicle inreview.Why it matters: The
reviewquadrant (high cost, low value proxy) contains the two escalation-threshold workflows plus two additional heavy consumers. None of them are justified by their current value signal at this cost level.4. Workflow Overlap Matrix
Decision: The smoke-test family (Smoke CI, Smoke Claude, Smoke Codex, Smoke Copilot, etc.) has high intra-cluster overlap — they share trigger, task domain, and behavioral fingerprint — but are likely intentional per-engine coverage, not redundancy.
Why it matters: The overlap is real but structurally motivated. Consolidation would reduce per-engine visibility. The more actionable question is whether Smoke Claude and Smoke Copilot need to be so heavyweight within their engine smoke slot.
Portfolio Opportunities
Repository Portfolio Detail
keep— cheap, stable, high repeat use:general_automation) — ideal candidate for deterministic replacement or minimal-model configuration. High repeat use justifies retention.optimize— moderate cost, good value:changedvs baseline. Cost is justified for a code-review/architecture gate. Consider cohort-match review to understand why it always diffs from baseline.read_only, stable. Good value but could benefit from model downgrade check.review— high cost, weak current value signal:write_heavy,exploratory,triagedomain, all runs flaggedresource_heavyHIGH +poor_agentic_controlMEDIUM. Smoke tests should not be write-heavy or exploratory. Strongest simplification candidate.write_heavy,exploratory, 3 runs. Same structural problem as Smoke Copilot — heavyweight behavior for a smoke validation task.resource_heavyHIGH. Read-only general automation with heavy token footprint. The 40 blocked requests suggest it probes many domains; partially reducible (50% data-gathering turns).exploratory,read_only,resource_heavyHIGH +poor_agentic_controlMEDIUM +model_downgrade_availableLOW. Single run; heavy for repo maintenance. Strong model-downgrade candidate.Stale/overlap candidates:
Recommended Actions
model_downgrade:engine.model: claude-haiku-4-5orgpt-4.1-mini. 30 turns, read-only, no writes — does not need a frontier model.model_downgraderecommendation. Low-severity, single run, but the signal is clear forissue_response+read_only+moderateprofile.changedvs baseline. No safety concern, but persistent drift from cohort match suggests prompt or task-shape instability.Deterministic replacement candidates (consider removing agentic layer entirely):
directed+read_only+lean+general_automation. A deterministic shell script or composite action would be faster and cheaper.Per-Workflow Run Detail
Smoke Claude (3 runs, all 2026-04-21)
resource_heavy_for_domainHIGH (turns 29–39, tool_types 24–26, write_actions 20–24)partially_reducibleMEDIUM (62–69% data-gathering turns)poor_agentic_controlMEDIUM (friction=0, exploratory, write_heavy)changedvs baseline on all 3 runsSmoke Copilot (3 runs, all 2026-04-21)
resource_heavy_for_domainHIGH (turns 15–20, write_actions 28–30, duration 10–14m)poor_agentic_controlMEDIUM (friction=0, exploratory, write_heavy)changed(2 runs) /stable(1 run)The Daily Repository Chronicle (1 run)
resource_heavy_for_domainHIGH (turns=32, duration=10m56s, write_actions=0)partially_reducibleLOW (50% data-gathering, agentic_fraction=0.50)Slide Deck Maintainer (1 run)
resource_heavy_for_domainHIGH (turns=30, duration=8m42s, write_actions=0)poor_agentic_controlMEDIUM |partially_reducibleLOW |model_downgrade_availableLOWReferences:
Beta Was this translation helpful? Give feedback.
All reactions