[ci] Fixed CI transient error re-run mechanism#650
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 29 minutes and 22 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR refines the CI failure bot's log classification logic to distinguish between strict test failure markers ( Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Excellent improvements in this iteration:
The two-tier marker approach is well-tested and ready for production. Files Reviewed (3 files)
Reviewed by kimi-k2.5-0127 · 149,977 tokens |
Separated test errors and transient error marker keywords to allow re-run
c415e07 to
ea267c6
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
tested retry mechanism here
|
Added test to prevent contention between conventional error marking and transient error marking
nemesifier
left a comment
There was a problem hiding this comment.
@stktyagi this PR fails to explain the problem that is causing the issue in the first place.
Whenever you open a PR, please explain the problem carefully.
As I do not know what exact problem was causing the auto-restart to not work, I cannot confidently know if this solves it properly.
| if is_transient: | ||
| for t_marker in TRANSIENT_FAILURE_MARKERS: | ||
| pattern = re.compile(re.escape(t_marker), re.IGNORECASE) | ||
| cleaned_body = pattern.sub("", cleaned_body) |
There was a problem hiding this comment.
Are you stripping out the transient marker right?
My concern is: the transient markers do not contain full errors, just part of it, does it make sense to strip something only partially?
There was a problem hiding this comment.
As it was mostly an issue related to the test failure keywords blocking the transient re-run it does fix most failure cases but I asked AI to create edge cases where it would fail and one particular edge case is
A network crash prints: "Traceback (most recent call last): ...
Because of errors like this, instead of stripping we can try this approach -
- We can categorize failure markers into "Strict" (FAIL:, AssertionError) which always blocks retries, and "Generic" (ERROR:, Traceback) which can result from either real bugs or infrastructure crashes.
- Now, if a transient crash is detected, the bot would leave the logs completely unaltered and safely "forgives" the generic markers so the auto-retry can successfully trigger.
I'll test this once in bot-testing-ground and update the PR.
Understood! As there was no preceding issue for this I must have missed issue explanation properly. I'll update the the description and make sure going forward |
R^Clace regex stripping with strict/generic failure categorization
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/reusable-bot-ci-failure.yml:
- Line 55: The workflow checkout input currently sets "ref:
fix/transient-error-rerun" which will fail after the branch is deleted; change
the checkout step to use "ref: master" (or remove the ref input so it defaults
to the repository's default branch) where the current "ref:
fix/transient-error-rerun" appears to ensure trusted_scripts/analyze loads
correctly when merged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: bdbfdec8-815b-4250-8c44-ed74a311da7e
📒 Files selected for processing (3)
.github/actions/bot-ci-failure/analyze_failure.py.github/actions/bot-ci-failure/test_analyze_failure.py.github/workflows/reusable-bot-ci-failure.yml
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: Python==3.12 | django~=4.2.0
- GitHub Check: Python==3.12 | django~=5.0.0
- GitHub Check: Python==3.12 | django~=5.1.0
- GitHub Check: Python==3.10 | django~=5.1.0
- GitHub Check: Python==3.11 | django~=4.2.0
- GitHub Check: Python==3.10 | django~=5.0.0
- GitHub Check: Python==3.13 | django~=5.1.0
- GitHub Check: Python==3.13 | django~=5.2.0
- GitHub Check: Python==3.10 | django~=4.2.0
- GitHub Check: Python==3.12 | django~=5.2.0
- GitHub Check: Python==3.10 | django~=5.2.0
- GitHub Check: Python==3.11 | django~=5.0.0
- GitHub Check: Python==3.11 | django~=5.2.0
- GitHub Check: Python==3.11 | django~=5.1.0
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (actions)
- GitHub Check: Analyze (python)
- GitHub Check: Kilo Code Review
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2026-03-24T16:40:15.783Z
Learnt from: stktyagi
Repo: openwisp/openwisp-utils PR: 631
File: .github/workflows/bot-ci-failure.yml:71-74
Timestamp: 2026-03-24T16:40:15.783Z
Learning: In openwisp/openwisp-utils reusable-bot-ci-failure.yml, the auto-retry step uses a custom GitHub App token (generated from APP_ID and PRIVATE_KEY secrets) passed to the gh CLI via the GH_TOKEN environment variable. This token is NOT subject to the workflow's `permissions` block (which only restricts GITHUB_TOKEN). Therefore, having `actions: read` in the reusable workflow's permissions block does NOT break the auto-retry `gh api -X POST` call — it authenticates with the App's own permissions. Do not flag this as a permissions issue.
Applied to files:
.github/workflows/reusable-bot-ci-failure.yml
📚 Learning: 2026-03-05T20:55:29.968Z
Learnt from: nemesifier
Repo: openwisp/openwisp-utils PR: 0
File: :0-0
Timestamp: 2026-03-05T20:55:29.968Z
Learning: For the openwisp/openwisp-utils repository (and the broader openwisp org), the maintainers have explicitly decided to use mutable version tags (e.g. `v2`, `v6`) for GitHub Actions and to use `ref: master` for internal script checkouts. Do NOT flag unpinned action versions or `master`/version-tag usage as security issues in any openwisp-utils workflow file. The maintainers are aware of the trade-offs and are intentionally fine with this approach.
Applied to files:
.github/workflows/reusable-bot-ci-failure.yml
📚 Learning: 2026-03-05T09:38:10.320Z
Learnt from: pushpitkamboj
Repo: openwisp/openwisp-utils PR: 584
File: .github/workflows/reusable-bot-changelog.yml:49-49
Timestamp: 2026-03-05T09:38:10.320Z
Learning: In openwisp-utils, PR title prefixes are strictly limited to `[feature]`, `[fix]`, and `[change]` (exact bracketed tags, no scoping/sub-types). The regex `^\[(feature|fix|change)\]` in `.github/workflows/reusable-bot-changelog.yml` is intentional and correct — scoped variants like `[feature/bots]` are not valid and should not be matched.
Applied to files:
.github/workflows/reusable-bot-ci-failure.yml
📚 Learning: 2026-03-14T20:44:14.568Z
Learnt from: CR
Repo: openwisp/openwisp-utils PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-03-14T20:44:14.568Z
Learning: Cryptic or non-obvious code (regex, complex bash commands, or hard-to-read code) must include a concise comment explaining why it is needed and why the complexity is acceptable
Applied to files:
.github/actions/bot-ci-failure/analyze_failure.py
🔇 Additional comments (3)
.github/actions/bot-ci-failure/test_analyze_failure.py (1)
291-299: LGTM — directly exercises the regression this PR fixes.Good coverage for the case where a transient infrastructure log contains the
ERROR:keyword that previously blocked auto-retry. The assertions correctly validate bothtransient_only=Trueandtests_failed=False..github/actions/bot-ci-failure/analyze_failure.py (2)
9-24: Clean separation of strict vs. generic markers.The split cleanly addresses the root cause:
ERROR:alone should no longer force a test-failure classification when a transient marker is present. Keeping the combinedTEST_FAILURE_MARKERStuple preserves_extract_failed_testsbehavior for block isolation. Comments are helpful.
159-184: Categorization logic is correct; the three-way branching and prompt rule are sound.The implementation correctly separates strict ("FAIL:", "AssertionError") from generic ("ERROR:", "Traceback") markers. The transient counter correctly increments only when
is_transient and not job_has_test_failure, preventing jobs with strict failures from being miscounted as transient-only.The prompt rule at lines 393–395 is appropriately defensive for the scenario where a single job body contains both transient infrastructure noise and genuine test failures (e.g., both a network error AND an AssertionError). In this case,
tests_failed=Truebut transient noise is still surfaced to the LLM, making the rule necessary to guide the contributor appropriately.No other code references the combined
TEST_FAILURE_MARKERSwith assumptions about its composition, confirming the split into separate markers is clean.
Added 4 tests to cover each failure case and changed ref to master.
|
@nemesifier |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|


Checklist
Description of Changes
The auto-retry failed to trigger on transient infrastructure crashes (like network drops) because their log outputs contained the word "ERROR:". Since "ERROR:" was globally listed as a test failure marker, the script falsely flagged these transient issues as real code bugs and aborted the retry. (e.g.,
"ERROR: Could not install packages due to an OSError").To fix this:
Failure markers are categorized into Strict (e.g., FAIL:, which always block retries) and Generic (e.g., "ERROR:", which can be bugs or infrastructure crashes). Now, if the script detects a known transient crash, it safely "forgives" any generic markers and successfully triggers the auto-retry.