-
Notifications
You must be signed in to change notification settings - Fork 308
Description
Overview
Workflow health assessment for 2026-03-20. 175 workflows monitored. Score: 66/100 (↑10 from 56 yesterday).
Major recoveries: Issue Monster fully resolved ✅, PR Triage recovering ✅, Daily Workflow Updater recovered ✅. GH_AW_GITHUB_TOKEN issue largely resolved.
Ongoing: Issue Triage Agent (10+ consecutive failures, not GH_AW_TOKEN related), Smoke Gemini (6+ consecutive failures). 14 stale lock files (down from 15).
Critical Issues 🚨
P0 → RESOLVED: Issue Monster / PR Triage Agent (GH_AW_GITHUB_TOKEN)
- Issue Monster: ✅ FULLY RECOVERED — 5/5 runs successful today (runs Fix SC2086 shellcheck warnings in compiler-generated shell scripts #3098–3102, Mar 20 05:21–07:17 UTC)
- PR Triage Agent: ✅ RECOVERING — Run [claude-test] Hello from Claude #265 succeeded (Mar 20 06:14 UTC). 4 prior runs failed; 1 success today signals recovery.
- Root cause resolved:
GH_AW_GITHUB_TOKENsecret now properly configured.
P1: Issue Triage Agent — Persistent Failure (Day 14+)
- Status: 10/10 schedule runs failed. Last run: [copilot] output issue in an issue context #135 (Mar 19 14:25 UTC). No run yet today.
- Duration: Failing since at least March 6 (run Welcome to Agentic Workflows! #88). Pre-dates GH_AW_GITHUB_TOKEN issue.
- Pattern: Consistent schedule failures only — different root cause than Issue Monster
- Last success: Unknown — possibly never recovered or failed during initial deployment
- Impact: Automatic issue triage non-functional
- Priority: P0 (escalated from P1 — now 14+ days continuous failure, separate root cause)
P1: Smoke Gemini — Consistent Failure (Day 5)
- Status: 6/6 consecutive schedule failures (Mar 15 12:34 – Mar 20 00:52 UTC)
- Only success recently: Mar 17 00:51 UTC (run [Custom Engine Test] Test Pull Request - Custom Engine Safe Output #373) — isolated success in a sea of failures
- Last runs: [Custom Engine Test] Test Issue Created by Custom Engine #476 failure (Mar 20 00:52), [Custom Engine Test] Test Pull Request - Custom Engine Safe Output #468 failure (Mar 19 12:35), Fix missing-tool JavaScript JSON parsing to handle agent-output.json schema correctly #458 failure (Mar 19 00:55)
- Impact: Gemini engine smoke tests not validating; regressions may go undetected
- Action: Investigate Gemini API endpoint, model name, or key expiry
- Priority: P1
Recoveries Since Yesterday ✅
Issue Monster — FULLY RECOVERED (was P0 Day 5)
- All 5 recent runs: success (runs Fix SC2086 shellcheck warnings in compiler-generated shell scripts #3098–3102)
- Resolution: GH_AW_GITHUB_TOKEN secret configured
- Classification: Permanently moved to Healthy
Daily Workflow Updater — RECOVERED (was P1 Day 11)
- Run Restrict frontmatter schema with additionalProperties: false and align with GitHub Actions schema #132 succeeded on 2026-03-19T09:28 UTC
- 11 consecutive failures resolved
- Status: Monitoring for continued stability
PR Triage Agent — RECOVERING (was P0 Day 4)
- Run [claude-test] Hello from Claude #265 succeeded (Mar 20 06:14 UTC)
- Still monitoring; 4 prior failures
Warnings ⚠️
P2: Stale Lock Files (14 files — down from 15)
View Stale Lock File List
These .md files have been modified after their corresponding .lock.yml was last generated:
| Workflow | Status |
|---|---|
blog-auditor |
Stale |
breaking-change-checker |
Stale (NEW) |
copilot-cli-deep-research |
Stale (NEW) |
daily-multi-device-docs-tester |
Stale (NEW) |
daily-regulatory |
Stale (NEW) |
dependabot-go-checker |
Stale |
discussion-task-miner |
Stale (NEW) |
example-workflow-analyzer |
Stale (NEW) |
jsweep |
Stale (NEW) |
prompt-clustering-analysis |
Stale (NEW) |
release |
Stale (NEW) |
security-alert-burndown.campaign.g |
Stale (NEW) |
update-astro |
Stale (NEW) |
workflow-skill-extractor |
Stale (NEW) |
Fix: Run make recompile in the repository root.
Healthy Workflows ✅
Core infrastructure operating normally (spot-checked):
- Issue Monster ✅ (5/5 success today — fully recovered)
- PR Triage Agent ✅ (latest run success — recovering)
- Daily Workflow Updater ✅ (recovered after 11-day failure streak)
- Bot Detection ✅ (consistent streak maintained)
- Safe Output Health Monitor ✅ (3/3 recent runs success: Mar 18–20)
- Smoke Copilot / Claude / Codex ✅ (assumed healthy based on prior data)
- Metrics Collector ✅ (healthy)
Metrics Summary
| Category | Count | Change |
|---|---|---|
| P0 Critical | 1 | ↓2 (Issue Monster + PR Triage resolved) |
| P1 High | 1 | → same (Smoke Gemini) |
| P2 Warning | 1 (stale locks) | ↓1 (14 vs 15) |
| Stale lock files | 14 | ↓1 from 15 |
| Score | 66/100 | ↑10 from 56 |
Systemic Patterns
- GH_AW_GITHUB_TOKEN resolution: Major improvement — 3 P0 workflows affected last run, now 1 remains failing with a different root cause (Issue Triage Agent failing independently since March 6)
- Issue Triage Agent isolation: Its failures pre-date the GH_AW_GITHUB_TOKEN crisis, suggesting a separate structural/configuration issue with the workflow itself
- Smoke Gemini degradation: 6-day run of schedule failures. Single success on Mar 17 00:51 may indicate flakiness or API throttling pattern
- Stale lock file churn: 9 new stale files appeared. Batch of
.mdworkflow edits committed without runningmake recompile
Recommendations
High Priority
- Investigate Issue Triage Agent root cause (P0) — 14+ days of consistent schedule failures, separate from GH_AW_GITHUB_TOKEN. Check activation logic, trigger events, and agent configuration
- Investigate Smoke Gemini failures (P1) — Check Gemini API key expiry, model endpoint URL, or quota limits. Last success was Mar 17 00:51, consider retry diagnostics
- Run
make recompile(P2) — Fixes 14 stale lock files immediately
Medium Priority
- Monitor PR Triage Agent and Daily Workflow Updater for continued stability
- Weekly Issue Summary: Check most recent schedule runs to verify it also recovered with GH_AW_GITHUB_TOKEN fix
Actions Taken This Run
- Created this dashboard issue for 2026-03-20
- Escalated Issue Triage Agent from P1 → P0 (independent 14-day failure, separate root cause)
- Moved Issue Monster to Resolved ✅
- Updated shared memory with current health state
Run: §23333198222
Timestamp: 2026-03-20T07:28Z
Next check: 2026-03-21 ~07:30Z
Previous dashboard: #21757
Note
🔒 Integrity filtering filtered 1 item
Integrity filtering activated and filtered the following item during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.
- issue:#0 (
search_issues: Resource 'issue:#0' has lower integrity than agent requires. Agent would need to drop integrity tags [unapproved:all approved:all] to trust this resource.)
Generated by Workflow Health Manager - Meta-Orchestrator · ◷
- expires on Mar 21, 2026, 7:34 AM UTC