Skip to content

feat(ci): add CI Failure Doctor workflow#340

Merged
Mossaka merged 5 commits intomainfrom
copilot/add-ci-failure-doctor-workflow
Jan 18, 2026
Merged

feat(ci): add CI Failure Doctor workflow#340
Mossaka merged 5 commits intomainfrom
copilot/add-ci-failure-doctor-workflow

Conversation

Copy link
Contributor

Copilot AI commented Jan 18, 2026

Creates an automated CI Failure Doctor workflow that investigates CI/CD failures, analyzes logs, identifies root causes, and creates investigation issues.

Changes Made

  • New workflow: .github/workflows/ci-doctor.md - an agentic workflow that:
    • Triggers on workflow_run completion for all 23 workflows in the repository (GitHub Actions doesn't support wildcards)
    • Only activates when conclusion == 'failure'
    • Uses GitHub Actions tools to fetch and analyze job logs
    • Searches cache-memory for similar historical failures
    • Creates investigation issues with root cause analysis and recommended actions
    • Labels issues with bug, ci

Design Decisions

  • All workflows monitored: Since GitHub Actions doesn't support wildcards for workflow_run, all workflow names are explicitly listed with a maintenance comment
  • Concise instructions: The agent prompt is intentionally minimal (~40 lines) to give the agent maximum freedom to explore and discover insights rather than following a rigid investigation structure
  • Key patterns provided: Common AWF-specific failure patterns (Docker network conflicts, container cleanup issues, iptables problems, Squid proxy failures) are listed as hints without prescriptive investigation steps

Configuration

  • Timeout: 10 minutes
  • Tools: GitHub API (with actions toolset), cache-memory
  • Safe outputs: Issue creation with "🏥 CI Failure" prefix, comments (max 1)
Original prompt

This section details on the original issue you should resolve

<issue_title>[plan] Implement CI Failure Doctor workflow</issue_title>
<issue_description>## Objective

Create an automated workflow that investigates CI/CD failures, analyzes logs, identifies root causes, and creates detailed investigation reports with remediation steps.

Context

This repository has complex Docker/networking tests that frequently fail with opaque errors like "subnet pool overlap" and container cleanup race conditions. Manual log analysis wastes developer time and accumulated failure patterns aren't leveraged.

Approach

  1. Create workflow file: .github/workflows/ci-doctor.md
  2. Configure triggers:
    • workflow_run on completion of: test-integration, test-coverage, test-action
    • Only trigger when conclusion == 'failure'
  3. Implement investigation protocol:
    • Fetch workflow run details and job logs via GitHub API
    • Analyze for Docker network issues (subnet pool exhaustion, overlaps)
    • Check for container cleanup race conditions
    • Detect iptables rule conflicts
    • Identify Squid proxy startup failures
    • Search for similar past failures using cache-memory
  4. Create investigation report:
    • Detailed root cause analysis
    • Specific remediation steps
    • Link to similar past failures
    • Label with bug, ci, needs-investigation
  5. Store failure patterns: Update cache-memory with new patterns

Files to Create

Domain-Specific Focus Areas

  • Docker network pool exhaustion (172.30.0.0/24 subnet conflicts)
  • Container cleanup race conditions (timeout kills leave orphaned resources)
  • iptables rule conflicts (NET_ADMIN capability issues)
  • Squid proxy healthcheck failures
  • GitHub Actions runner Docker version incompatibilities

Acceptance Criteria

  • Workflow triggers automatically on CI failures for specified test workflows
  • Creates detailed investigation issues with root cause analysis
  • Identifies Docker/networking specific failure patterns
  • Searches cache-memory for similar historical failures
  • Provides actionable remediation steps
  • Completes within 10 minutes timeout

Success Metrics

AI generated by Plan Command for discussion #328

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 2 commits January 18, 2026 01:43
Co-authored-by: Mossaka <5447827+Mossaka@users.noreply.github.com>
Co-authored-by: Mossaka <5447827+Mossaka@users.noreply.github.com>
Copilot AI requested a review from Mossaka January 18, 2026 01:45

on:
workflow_run:
workflows: ["TypeScript Type Check", "Test Coverage", "Test Setup Action", "Examples Test"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot can you just run for every workflows in this repo? Is there a "*" syntax in agentic workflows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub Actions workflow_run doesn't support wildcards - each workflow must be listed explicitly. Updated to list all 23 workflows in the repo in 1c080d6.

- **Head SHA**: ${{ github.event.workflow_run.head_sha }}
- **Head Branch**: ${{ github.event.workflow_run.head_branch }}

## Investigation Protocol
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot modify the instructions to make it much more concise. the instructions should be generally vague and not be specific, to allow the agent to have the maximam degree of freedom to explore by itself to discover insights, instead of instructing the agent to follow a rigid structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified the instructions from 268 lines to ~40 lines. Now provides just context, a 5-step mission, key patterns to watch for, and expected output format. Agent has full freedom to explore and discover insights. See 1c080d6.

Copilot AI and others added 2 commits January 18, 2026 21:28
Co-authored-by: Mossaka <5447827+Mossaka@users.noreply.github.com>
Co-authored-by: Mossaka <5447827+Mossaka@users.noreply.github.com>
Copilot AI changed the title [WIP] Create automated CI failure doctor workflow feat(ci): add CI Failure Doctor workflow Jan 18, 2026
Copilot AI requested a review from Mossaka January 18, 2026 21:30
@Mossaka Mossaka marked this pull request as ready for review January 18, 2026 22:42
@Mossaka Mossaka merged commit 64b4e40 into main Jan 18, 2026
28 checks passed
@Mossaka Mossaka deleted the copilot/add-ci-failure-doctor-workflow branch January 18, 2026 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[plan] Implement CI Failure Doctor workflow

2 participants