diff --git a/.claude/commands/ci-report.md b/.claude/commands/ci-report.md new file mode 100644 index 0000000000..dcfa40a78f --- /dev/null +++ b/.claude/commands/ci-report.md @@ -0,0 +1,216 @@ +--- +name: ci-report +description: Generate a CI oncall handoff report analyzing GitHub Actions workflow runs on master and release branches. Shows failures, flaky tests, and action items. +user_invocable: true +--- + +# CI Oncall Report + +Generate a concise CI health report for the Collector oncall handler. This report is designed for oncall handoff — lead with action items, keep it tight. + +## Arguments + +The user may provide a natural language time range as the argument (e.g. "today", "last 3 days", "this week", "since Monday"). Default to "last 24 hours" if no argument is given. Cap at 7 days maximum — if the user requests more, tell them and use 7 days. + +Convert the time range to an ISO 8601 date (YYYY-MM-DD) for the `--created` filter. Use `>=` (not `>`) so that "today" includes today's runs. + +## Process + +Follow these steps in order. Use the Bash tool for all `gh` commands. + +### Step 1: Detect the repository + +```bash +gh repo view --json nameWithOwner -q .nameWithOwner +``` + +If this fails, fall back to parsing the git remote: +```bash +git remote get-url origin | sed 's|.*github.com[:/]||;s|\.git$||' +``` + +Store the result as `REPO` for subsequent commands. + +### Step 2: Fetch all workflow runs + +```bash +gh run list --repo REPO --created ">=YYYY-MM-DD" --limit 500 --json headBranch,status,conclusion,workflowName,databaseId,createdAt,updatedAt,url,event +``` + +If exactly 500 results are returned, warn the user that results may be truncated and suggest narrowing the time window. + +Filter the results to only branches matching `^(master|release-\d+\.\d+)$` — this excludes feature branches and sub-branches like `release-3.24/foo`. + +### Step 3: Group and summarize + +Count **workflow runs** (not individual jobs within runs). Each entry from `gh run list` is one workflow run. For each branch, count: +- Passed (conclusion == "success") +- Failed (conclusion == "failure") +- Cancelled (reported separately, excluded from pass rate) + +Do NOT count skipped jobs or individual job statuses in the summary table — this table is about whole workflow runs only. Individual job failures belong in the Failure Details section. + +Calculate pass rate as: passed / (passed + failed) * 100. + +### Step 4: Fetch failure details + +For each failed run: + +1. Get the failed jobs: +```bash +gh run view RUN_ID --repo REPO --json jobs +``` + +2. Get the failed log output and search for the **real root cause**: +```bash +gh run view RUN_ID --repo REPO --log-failed 2>&1 | grep -e "FAIL:" -e "fatal:.*FAILED" -e "TASK \[" -e "Configure VSI" -e "Run integration tests" -e "##\[error\]" | grep -v "RETRYING" | head -20 +``` + +**IMPORTANT: Do NOT use `tail` on the log output.** The end of the log is typically git cleanup, artifact upload, and `Unarchive logs` steps — these are post-test housekeeping, not the root cause. The `Unarchive logs` step failing with `tar: container-logs/*.tar.gz: Cannot open` is a **symptom** (tests didn't produce logs), never the root cause. + +Instead, search the full log for the actual failure by looking for: +- `--- FAIL: TestName` — Go test failures (the tests ran and a specific test failed) +- `fatal: [hostname]: FAILED!` — Ansible task failures (VM provisioning, image pulls, etc.) +- `TASK [task-name]` lines immediately before `fatal:` lines — identifies which ansible step failed +- `##[error]` — GitHub Actions step errors +- Build/compilation errors + +3. If you need more context around a specific error, use: +```bash +gh run view RUN_ID --repo REPO --log --job JOB_ID 2>&1 | grep -B5 -A10 "FAIL:\|fatal:.*FAILED" | head -50 +``` + +4. Classify the root cause: +- **Test failure**: A `--- FAIL: TestName` line means the tests ran and failed. Report the test name and the assertion/error message. These are real regressions or flaky tests. +- **VM provisioning failure**: An ansible `fatal:` on a `create-vm` or `Configure VSI` task means the test environment couldn't be set up. This is infrastructure, not a test problem. +- **Image pull failure**: An ansible `fatal:` on `Pull non-QA images` or `Pull QA images` could be a non-fatal warning if the tests still ran afterwards. Check whether `Run integration tests` appears later in the log — if it does, the pull failure was not the root cause. +- **Build failure**: A compilation error in the build step. Report the file and error. + +5. Summarize the root cause in one line, naming the specific test or ansible task that failed. + +If log fetching fails for a specific run, note it and continue with other runs. + +### Step 5: Detect flakiness + +Compare runs of the same workflow on the same branch. A job is flaky if it fails in some runs but passes in others within the time window **and the failure has the same root cause each time**. Track the failure frequency (e.g. "failed 2/5 runs"). + +A job that fails in multiple runs with **different** root causes (e.g. one run hits a repo mirror issue, another hits a timeout) is NOT flaky — those are separate infrastructure problems. Only flag as flaky when the same failure pattern repeats intermittently. + +### Step 6: Check previous reports for trends + +Look for existing report files in `docs/oncall/`: +```bash +ls -1 docs/oncall/*-ci-report*.md 2>/dev/null | sort -r | head -5 +``` + +If previous reports exist, read them and extract: +- **Pass rate per branch** from their Branch Health Summary tables +- **Action items** from their Action Items sections + +Use this to build two trend views: +1. **Pass rate trends** — how each branch's pass rate has changed across reports +2. **Action item tracking** — which items from previous reports are now resolved vs still failing + +If no previous reports exist, skip the trends section in the output. + +### Step 7: Generate the report + +Write the report following this exact structure. Be concise throughout — the report should be readable in under 2 minutes. + +**Linking**: Every claim in the report must be independently verifiable. Use the `url` field from the `gh run list` output to link to specific workflow runs. The GitHub Actions filter URL for a branch is `https://github.com/REPO/actions?query=branch%3aBRANCH_NAME`. Include these links so a human reader can click through and verify any data point. + +#### Section 1: Action Items + +This is the most important section. Put it first. List things needing attention, most urgent first. Each item should include: +- What needs attention and why +- Link to the relevant run(s) +- Classification: regression, flaky, infrastructure, or needs investigation + +Example format: +``` +- **Regression**: integration-tests failing on master since Mar 24 — NetworkConnection test timeout. [Run #1234](url) +- **Flaky**: k8s-integration-tests on release-3.24 — fails 2/5 runs, ProcessSignal assertion. [Run #1230](url) +- **Investigate**: Konflux build failures on release-3.23 — image pull error. [Run #1228](url) +``` + +If nothing needs attention: "All clear — no action items." + +#### Section 2: Branch Health Summary + +One line per branch. Count whole workflow runs only (not individual jobs). Cancelled runs shown separately, excluded from pass rate. Do NOT add a "Skipped" column. Link each branch name to its GitHub Actions filter page. + +``` +| Branch | Runs | Passed | Failed | Cancelled | Pass Rate | +|--------------|------|--------|--------|-----------|-----------| +| [master](https://github.com/REPO/actions?query=branch%3Amaster) | 12 | 11 | 1 | 0 | 92% | +| [release-3.24](https://github.com/REPO/actions?query=branch%3Arelease-3.24) | 8 | 6 | 2 | 0 | 75% | +``` + +#### Section 3: Flaky Jobs + +Only include this section if flakiness was detected. Link to an example failing run for each entry. + +``` +| Job | Branch | Fail Rate | Pattern | Example | +|-------------------|--------------|-----------|---------------------|---------| +| NetworkConnection | master | 2/10 | Timeout after 120s | [Run #1234](url) | +``` + +#### Section 4: Failure Details + +Group by root cause where possible. Each entry: +- Branch, workflow, run link +- Failed job name +- One-line root cause + +Only include log excerpts when the cause is non-obvious. Keep this section short. + +#### Section 5: Trends + +Only include this section if previous reports were found in `docs/oncall/`. + +**Pass rate trends** — show how each branch's health has changed. Use the dates from previous report filenames as column headers. + +``` +| Branch | Mar 23 | Mar 24 | Mar 25 (today) | +|--------------|--------|--------|----------------| +| master | 100% | 85% | 92% | +| release-3.24 | 90% | 75% | 75% | +``` + +**Action item tracking** — compare today's action items against previous reports. For each previous action item, note whether it's resolved, still present, or new. + +``` +- **Resolved**: NetworkConnection timeout on master (first seen Mar 23, resolved today) +- **Ongoing**: Konflux build failures on release-3.23 (first seen Mar 24, still failing) +- **New**: integration-tests regression on master (first seen today) +``` + +Keep this concise — only mention items that changed status or have persisted for multiple reports. + +#### Section 6: Stats + +Reference information at the bottom: +- Date range analyzed +- Total runs across all branches +- Overall pass rate +- Report generated timestamp + +### Step 8: Save the report + +1. Create the output directory if needed: +```bash +mkdir -p docs/oncall +``` + +2. Save to `docs/oncall/YYYY-MM-DD-ci-report.md` using today's date. + +3. If that file already exists, try `-2`, `-3`, etc. until a unique filename is found. + +4. Display the full report content in the terminal as well. + +## Error Handling + +- If `gh` is not authenticated, tell the user to run `! gh auth login` (the `!` prefix runs it in the current session). +- If no runs are found in the time window, report that clearly — don't generate an empty report. +- If individual log fetches fail, note the failure and continue with other runs.