Skip to content

Fix Claude build analysis gate: correctly skip only when all failures are non-essential#22760

Merged
mokagio merged 9 commits intotrunkfrom
iangmaia/update-claude-build-analysis
Apr 9, 2026
Merged

Fix Claude build analysis gate: correctly skip only when all failures are non-essential#22760
mokagio merged 9 commits intotrunkfrom
iangmaia/update-claude-build-analysis

Conversation

@iangmaia
Copy link
Copy Markdown
Contributor

@iangmaia iangmaia commented Apr 2, 2026

Description

Improves the Claude Build Analysis Buildkite integration:

  1. Model upgrade: Sonnet 4.5 → Sonnet 4.6
  2. Custom fork: Switches to iangmaia/claude-summarize fork with build_log_mode: "failed" and max_log_lines: 1500 for focused log analysis
  3. Smart gating: Skips Claude analysis entirely when only non-essential jobs (e.g. Danger) fail — no annotation, no noise
  4. Co-failure correctness: When non-essential jobs fail alongside real failures, Claude still runs. Uses the Buildkite API to count total failures and compare against non-essential failures. Fails safe (runs Claude if API is unreachable).
  5. Prompt improvements: Tells Claude to ignore Danger, broken jobs, and the analysis job itself. Uses YAML literal scalar (|) to preserve list formatting.
  6. Script hardening: Portable shebang, proper Bash array for step keys, timed_out jobs included in failure count.

Test Steps

  1. Push a change that breaks the build → Claude analysis should run and annotate real failures
  2. Push a change where only Danger fails → Claude analysis should be skipped entirely (no annotation)
  3. If both Danger and a real test fail → Claude should still run

  • I have considered if this change warrants user-facing release notes and have added them to RELEASE-NOTES.txt if necessary.

@iangmaia iangmaia self-assigned this Apr 2, 2026
@wpmobilebot
Copy link
Copy Markdown
Contributor

wpmobilebot commented Apr 2, 2026

App Icon📲 You can test the changes from this Pull Request in Jetpack Android by scanning the QR code below to install the corresponding build.

App NameJetpack Android
Build TypeDebug
Versionpr22760-ab82dac
Build Number1488
Application IDcom.jetpack.android.prealpha
Commitab82dac
Installation URL5r3fh0k3ostko
Automatticians: You can use our internal self-serve MC tool to give yourself access to those builds if needed.

@wpmobilebot
Copy link
Copy Markdown
Contributor

wpmobilebot commented Apr 2, 2026

App Icon📲 You can test the changes from this Pull Request in WordPress Android by scanning the QR code below to install the corresponding build.

App NameWordPress Android
Build TypeDebug
Versionpr22760-ab82dac
Build Number1488
Application IDorg.wordpress.android.prealpha
Commitab82dac
Installation URL09mofgpufsmcg
Automatticians: You can use our internal self-serve MC tool to give yourself access to those builds if needed.

@iangmaia iangmaia added this to the 26.8 milestone Apr 2, 2026
@iangmaia iangmaia requested review from mokagio and twstokes April 2, 2026 18:58
@iangmaia iangmaia enabled auto-merge (squash) April 2, 2026 19:01
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 2, 2026

Comment on lines +8 to +18
# Non-essential steps whose failure alone should NOT trigger Claude analysis.
# Add step keys here as needed.
NON_ESSENTIAL_STEPS="danger"

for step_key in ${NON_ESSENTIAL_STEPS}; do
OUTCOME=$(buildkite-agent step get outcome --step "${step_key}" 2>/dev/null || echo "not_run")
if [ "${OUTCOME}" = "failed" ]; then
echo "Only non-essential job '${step_key}' failed, skipping Claude analysis"
exit 0
fi
done
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot can you check this logic for me? My Bash is not strong, but the input in the for loop does not look like an array to me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — it was a plain string, not an array. With an unquoted string variable, the for loop relies on word splitting (which works for simple keys but is fragile and can trigger glob expansion). Fixed in 0bbf3dd by converting NON_ESSENTIAL_STEPS to a proper Bash array and iterating with "${NON_ESSENTIAL_STEPS[@]}":

NON_ESSENTIAL_STEPS=("danger")

for step_key in "${NON_ESSENTIAL_STEPS[@]}"; do

Adding more step keys is now just a matter of adding elements to the array: ("danger" "other-step").

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Another question: In the way the check is, I get the idea that if one of the non-essential steps fail, e.g. danger, we'll hit the exit 0 line before we have a chance to check whether any of the essential steps failed. Or am I missing something?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — that was a real logic bug. The old code exited as soon as it saw any non-essential step fail, so if Danger and essential tests both failed, Claude would be skipped incorrectly.

Fixed in 9ceba24. The new approach:

  1. Loop through all non-essential steps and count their failures (no early exit).
  2. If zero non-essential failures → something essential failed → run Claude.
  3. If some non-essential failures → query the Buildkite REST API for the total failed job count in this build.
  4. Only skip Claude when non_essential_failures == total_failures (i.e., all failures are accounted for by known non-essential steps).
  5. If the API call fails → fail-safe: run Claude anyway.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only skip Claude when non_essential_failures == total_failures (i.e., all failures are accounted for by known non-essential steps).

But in that case, if we have two non-essential failures and only one failed (e.g. danger and swiftlint are non-essential. danger is green, but swiftlint failed) we would not hit the return condition, would we?

Copy link
Copy Markdown
Contributor Author

@iangmaia iangmaia Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mokagio I believe I've fixed the issues you mentioned / found with Co-pilot in this PR and in the others. I was reluctant to use the build status call (it was also suggested to me by Claude) but it's probably the best way in terms of correctness and reliability of the pipeline.

I've tweaked the solution a little bit and updated on 3e7fd18, but the main changes are done in other repos.

Thanks for looking into it! 🙇

build_log_mode: "failed"
trigger: "on-failure"
max_log_lines: 1500
custom_prompt: >
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about folded scalar > and literal scalar |.

> (folded scalar) joins lines into a single line — newlines become spaces. | (literal scalar) preserves newlines as-is.

  For a prompt sent to an LLM, > is fine since the model doesn't care about line breaks. | would matter if the exact formatting needed to be preserved (e.g., code blocks, numbered lists where you
  want literal newlines).

  In this case either works, but > is the more natural choice for prose.

Opus 4.6 says that > is better for prose, but I disagree. The prompt here has an ordered list and an intentional new line. If models care about all caps and bold, we should also carry over lists and paragraphs.

Suggested change
custom_prompt: >
custom_prompt: |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the prompt has an ordered list and a blank paragraph separator, so | (literal scalar) is the right choice to preserve that structure. Changed in 0bbf3dd.

auto-merge was automatically disabled April 8, 2026 04:13

Head branch was pushed to by a user without write access

Copilot AI changed the title Update Claude build analysis prompt and model to Sonnet 4.6 Fix Bash script robustness and YAML scalar type in Claude build analysis Apr 8, 2026
Copilot AI requested a review from mokagio April 8, 2026 04:14
Copilot AI changed the title Fix Bash script robustness and YAML scalar type in Claude build analysis Fix Claude build analysis gate: correctly skip only when all failures are non-essential Apr 8, 2026
iangmaia and others added 7 commits April 8, 2026 17:41
Improve the custom prompt to ignore non-real failures (the intentional
exit 1 trigger, Danger PR Check, and broken/skipped jobs) and respond
briefly when no actual failures exist. Migrate model from Sonnet 4.5
to Sonnet 4.6.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use iangmaia/claude-summarize fork which adds build_log_mode, max_log_lines,
and on-failure trigger. This feeds only failed job logs to Claude (capped at
1500 lines) for more focused analysis with less noise.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add upload-claude-analysis.sh that checks non-essential step outcomes
before uploading the Claude analysis pipeline. When only Danger or
other non-critical jobs failed, the pipeline is not uploaded — no
analysis, no annotation, no PR comment.

Also simplify the custom prompt now that Danger is filtered upstream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When real failures co-occur with a Danger failure, Claude should still
ignore Danger rather than wasting analysis space on it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The jq filter only matched state == "failed", missing jobs with
state == "timed_out". This could cause Claude to be skipped when a
real job times out alongside a non-essential failure like Danger.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@iangmaia iangmaia force-pushed the iangmaia/update-claude-build-analysis branch from 6950c3a to 3e7fd18 Compare April 8, 2026 15:41
@mokagio
Copy link
Copy Markdown
Contributor

mokagio commented Apr 9, 2026

Thanks @iangmaia !

I opened #22779 to test this. I verified the build analysis runs when there are build failures. Unfortunately, the script did not behave as expected when only Danger failed and still run the build failure analysis

image

I had Claude monitor the pipeline and this is what it guessed

⏺ CI Results — Build #25938

  The gating step did not skip Claude analysis as expected. Here's what happened:

  All builds and tests passed. Only Danger failed (missing label/milestone) — as expected.

  Bug found: The upload-claude-analysis.sh gating script failed to detect Danger's failure. The
  buildkite-agent step get outcome --step "danger" call likely can't resolve the step key when
  it's nested inside a group: "Linters" block in the pipeline YAML. It falls back to "not_run",
  so non_essential_failures stays 0, and the script hits the early exit path assuming essential
  steps must have failed.

  The net effect: Claude analysis ran unnecessarily and (correctly) reported the build is healthy
   — but ideally it shouldn't have run at all.

  Want to investigate the grouped step key issue in upload-claude-analysis.sh?

`buildkite-agent step get outcome` returns `hard_failed`, not `failed`.
The wrong string meant non-essential failures were never detected.

---

Generated with the help of Claude Code, https://claude.ai/code

Co-Authored-By: Claude Code Opus 4.6 <noreply@anthropic.com>
@mokagio
Copy link
Copy Markdown
Contributor

mokagio commented Apr 9, 2026

1ddc7f4 fixes the issue reported above.

I run a few more scenarios in the #22779 PR to see what happened and it was all good.

Ready to ship.

Query total failures first via the API.
When zero, exit early instead of falling through
to the non-essential check which assumed failures.

---

Generated with the help of Claude Code, https://claude.ai/code

Co-Authored-By: Claude Code Opus 4.6 <noreply@anthropic.com>
@mokagio mokagio enabled auto-merge (squash) April 9, 2026 07:36
@mokagio mokagio merged commit 167a2d9 into trunk Apr 9, 2026
21 checks passed
@mokagio mokagio deleted the iangmaia/update-claude-build-analysis branch April 9, 2026 07:37
@mokagio
Copy link
Copy Markdown
Contributor

mokagio commented Apr 9, 2026

@iangmaia sorry, I went a bit rogue here and committed directly on your PR. Copilot had already added commits here so I just rolled with it. Besides, this seemed to me like mostly a matter of trial and error with the script till the criteria were met, more that a whole design overhaul.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants