Conversation
|
|
|
|
|
| # Non-essential steps whose failure alone should NOT trigger Claude analysis. | ||
| # Add step keys here as needed. | ||
| NON_ESSENTIAL_STEPS="danger" | ||
|
|
||
| for step_key in ${NON_ESSENTIAL_STEPS}; do | ||
| OUTCOME=$(buildkite-agent step get outcome --step "${step_key}" 2>/dev/null || echo "not_run") | ||
| if [ "${OUTCOME}" = "failed" ]; then | ||
| echo "Only non-essential job '${step_key}' failed, skipping Claude analysis" | ||
| exit 0 | ||
| fi | ||
| done |
There was a problem hiding this comment.
@copilot can you check this logic for me? My Bash is not strong, but the input in the for loop does not look like an array to me.
There was a problem hiding this comment.
You're right — it was a plain string, not an array. With an unquoted string variable, the for loop relies on word splitting (which works for simple keys but is fragile and can trigger glob expansion). Fixed in 0bbf3dd by converting NON_ESSENTIAL_STEPS to a proper Bash array and iterating with "${NON_ESSENTIAL_STEPS[@]}":
NON_ESSENTIAL_STEPS=("danger")
for step_key in "${NON_ESSENTIAL_STEPS[@]}"; doAdding more step keys is now just a matter of adding elements to the array: ("danger" "other-step").
There was a problem hiding this comment.
@copilot Another question: In the way the check is, I get the idea that if one of the non-essential steps fail, e.g. danger, we'll hit the exit 0 line before we have a chance to check whether any of the essential steps failed. Or am I missing something?
There was a problem hiding this comment.
You're right — that was a real logic bug. The old code exited as soon as it saw any non-essential step fail, so if Danger and essential tests both failed, Claude would be skipped incorrectly.
Fixed in 9ceba24. The new approach:
- Loop through all non-essential steps and count their failures (no early exit).
- If zero non-essential failures → something essential failed → run Claude.
- If some non-essential failures → query the Buildkite REST API for the total failed job count in this build.
- Only skip Claude when
non_essential_failures == total_failures(i.e., all failures are accounted for by known non-essential steps). - If the API call fails → fail-safe: run Claude anyway.
There was a problem hiding this comment.
Only skip Claude when non_essential_failures == total_failures (i.e., all failures are accounted for by known non-essential steps).
But in that case, if we have two non-essential failures and only one failed (e.g. danger and swiftlint are non-essential. danger is green, but swiftlint failed) we would not hit the return condition, would we?
There was a problem hiding this comment.
@mokagio I believe I've fixed the issues you mentioned / found with Co-pilot in this PR and in the others. I was reluctant to use the build status call (it was also suggested to me by Claude) but it's probably the best way in terms of correctness and reliability of the pipeline.
I've tweaked the solution a little bit and updated on 3e7fd18, but the main changes are done in other repos.
Thanks for looking into it! 🙇
.buildkite/claude-analysis.yml
Outdated
| build_log_mode: "failed" | ||
| trigger: "on-failure" | ||
| max_log_lines: 1500 | ||
| custom_prompt: > |
There was a problem hiding this comment.
TIL about folded scalar > and literal scalar |.
> (folded scalar) joins lines into a single line — newlines become spaces. | (literal scalar) preserves newlines as-is.
For a prompt sent to an LLM, > is fine since the model doesn't care about line breaks. | would matter if the exact formatting needed to be preserved (e.g., code blocks, numbered lists where you
want literal newlines).
In this case either works, but > is the more natural choice for prose.
Opus 4.6 says that > is better for prose, but I disagree. The prompt here has an ordered list and an intentional new line. If models care about all caps and bold, we should also carry over lists and paragraphs.
| custom_prompt: > | |
| custom_prompt: | |
There was a problem hiding this comment.
Agreed — the prompt has an ordered list and a blank paragraph separator, so | (literal scalar) is the right choice to preserve that structure. Changed in 0bbf3dd.
Head branch was pushed to by a user without write access
Improve the custom prompt to ignore non-real failures (the intentional exit 1 trigger, Danger PR Check, and broken/skipped jobs) and respond briefly when no actual failures exist. Migrate model from Sonnet 4.5 to Sonnet 4.6. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use iangmaia/claude-summarize fork which adds build_log_mode, max_log_lines, and on-failure trigger. This feeds only failed job logs to Claude (capped at 1500 lines) for more focused analysis with less noise. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add upload-claude-analysis.sh that checks non-essential step outcomes before uploading the Claude analysis pipeline. When only Danger or other non-critical jobs failed, the pipeline is not uploaded — no analysis, no annotation, no PR comment. Also simplify the custom prompt now that Danger is filtered upstream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When real failures co-occur with a Danger failure, Claude should still ignore Danger rather than wasting analysis space on it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…L scalar Agent-Logs-Url: https://github.com/wordpress-mobile/WordPress-Android/sessions/f2282ed4-20ab-42a2-b37b-6fac3bfa3514 Co-authored-by: mokagio <1218433+mokagio@users.noreply.github.com>
…tial co-failures Agent-Logs-Url: https://github.com/wordpress-mobile/WordPress-Android/sessions/1f6e9e79-7580-4d15-97d2-c85bb5452559 Co-authored-by: mokagio <1218433+mokagio@users.noreply.github.com>
The jq filter only matched state == "failed", missing jobs with state == "timed_out". This could cause Claude to be skipped when a real job times out alongside a non-essential failure like Danger. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6950c3a to
3e7fd18
Compare
|
Thanks @iangmaia ! I opened #22779 to test this. I verified the build analysis runs when there are build failures. Unfortunately, the script did not behave as expected when only Danger failed and still run the build failure analysis
I had Claude monitor the pipeline and this is what it guessed |
`buildkite-agent step get outcome` returns `hard_failed`, not `failed`. The wrong string meant non-essential failures were never detected. --- Generated with the help of Claude Code, https://claude.ai/code Co-Authored-By: Claude Code Opus 4.6 <noreply@anthropic.com>
Query total failures first via the API. When zero, exit early instead of falling through to the non-essential check which assumed failures. --- Generated with the help of Claude Code, https://claude.ai/code Co-Authored-By: Claude Code Opus 4.6 <noreply@anthropic.com>
|
@iangmaia sorry, I went a bit rogue here and committed directly on your PR. Copilot had already added commits here so I just rolled with it. Besides, this seemed to me like mostly a matter of trial and error with the script till the criteria were met, more that a whole design overhaul. |






Description
Improves the Claude Build Analysis Buildkite integration:
iangmaia/claude-summarizefork withbuild_log_mode: "failed"andmax_log_lines: 1500for focused log analysis|) to preserve list formatting.timed_outjobs included in failure count.Test Steps
RELEASE-NOTES.txtif necessary.