[ci] Fixed CI transient error re-run mechanism by stktyagi · Pull Request #650 · openwisp/openwisp-utils

stktyagi · 2026-04-19T18:38:08Z

Checklist

I have read the OpenWISP Contributing Guidelines.
I have manually tested the changes proposed in this pull request.
I have written new test cases for new code and/or updated existing tests for changes to existing code.
I have updated the documentation.

Description of Changes

The auto-retry failed to trigger on transient infrastructure crashes (like network drops) because their log outputs contained the word "ERROR:". Since "ERROR:" was globally listed as a test failure marker, the script falsely flagged these transient issues as real code bugs and aborted the retry. (e.g., "ERROR: Could not install packages due to an OSError").
To fix this:
Failure markers are categorized into Strict (e.g., FAIL:, which always block retries) and Generic (e.g., "ERROR:", which can be bugs or infrastructure crashes). Now, if the script detects a known transient crash, it safely "forgives" any generic markers and successfully triggers the auto-retry.

coderabbitai · 2026-04-19T18:38:27Z

Warning

Rate limit exceeded

@stktyagi has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 29 minutes and 22 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 29 minutes and 22 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 130791e2-547b-45d1-89f5-4c1af8fe9b77

📥 Commits

Reviewing files that changed from the base of the PR and between 7d5f12b and c8256c1.

📒 Files selected for processing (1)

.github/actions/bot-ci-failure/test_analyze_failure.py

📝 Walkthrough

Walkthrough

The PR refines the CI failure bot's log classification logic to distinguish between strict test failure markers (FAIL:, FAILED(, AssertionError), generic error markers (ERROR:, Traceback), and transient infrastructure markers. The process_error_logs function now applies conditional logic to determine whether a job represents real test failures or transient issues, with strict markers always forcing a test-failure classification and transient markers suppressing generic error interpretation. A corresponding unit test verifies transient marker detection. The reusable workflow is updated to checkout from a different branch. The LLM system prompt is enhanced with instructions to prioritize real test failures over infrastructure issues when both exist.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

[ci] Added Selenium connection failure to transient markers #648: Modifies analyze_failure.py's failure-classification logic to distinguish transient vs. real test failures, directly parallel to this PR's control flow changes.
[chores:fix] Fixed ci failure bot's ref path #606: Changes the checkout ref in the same reusable workflow file (reusable-bot-ci-failure.yml).
[ci] CI Failure Bot: run only on pull requests + OSError transient marker #627: Updates transient failure detection in analyze_failure.py and modifies the same CI failure bot.

Suggested labels

bug, github_actions, helper-bots

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (1 error)

Check name	Status	Explanation	Resolution
Bug Fixes	❌ Error	Workflow file references temporary feature branch fix/transient-error-rerun instead of master, will break after branch deletion.	Update .github/workflows/reusable-bot-ci-failure.yml line 55 to change ref: fix/transient-error-rerun to ref: master before merge.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title follows the required format [type] with descriptive text and directly addresses the main change: fixing the CI transient error re-run mechanism.
Description check	✅ Passed	The description covers the key sections including completed checklist items, clear explanation of the issue and fix approach, but lacks a reference to a linked issue number.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/transient-error-rerun

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coveralls · 2026-04-19T18:43:34Z

coverage: 97.529%. remained the same — fix/transient-error-rerun into master

kilo-code-bot · 2026-04-19T18:49:27Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Excellent improvements in this iteration:

Comprehensive test coverage - 4 new tests covering all scenarios:
- Pure transient errors allow retry
- Transient crashes with generic tracebacks don't falsely trigger failures
- Strict failures block retry even with transient errors present
- Generic tracebacks without transient errors are flagged as real bugs
Workflow ready for merge - ref restored to master

The two-tier marker approach is well-tested and ready for production.

Files Reviewed (3 files)

.github/actions/bot-ci-failure/analyze_failure.py - Two-tier marker logic
.github/workflows/reusable-bot-ci-failure.yml - Restored to master ref
.github/actions/bot-ci-failure/test_analyze_failure.py - Comprehensive test coverage

_{Reviewed by kimi-k2.5-0127 · 149,977 tokens}

Separated test errors and transient error marker keywords to allow re-run

stktyagi · 2026-04-19T19:02:45Z

@coderabbitai review

coderabbitai · 2026-04-19T19:02:51Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

stktyagi · 2026-04-19T19:06:32Z

tested retry mechanism here

Added test to prevent contention between conventional error marking and transient error marking

nemesifier

@stktyagi this PR fails to explain the problem that is causing the issue in the first place.
Whenever you open a PR, please explain the problem carefully.

As I do not know what exact problem was causing the auto-restart to not work, I cannot confidently know if this solves it properly.

nemesifier · 2026-04-20T17:58:44Z

+        if is_transient:
+            for t_marker in TRANSIENT_FAILURE_MARKERS:
+                pattern = re.compile(re.escape(t_marker), re.IGNORECASE)
+                cleaned_body = pattern.sub("", cleaned_body)


Are you stripping out the transient marker right?

My concern is: the transient markers do not contain full errors, just part of it, does it make sense to strip something only partially?

As it was mostly an issue related to the test failure keywords blocking the transient re-run it does fix most failure cases but I asked AI to create edge cases where it would fail and one particular edge case is
A network crash prints: "Traceback (most recent call last): ...

Because of errors like this, instead of stripping we can try this approach -

We can categorize failure markers into "Strict" (FAIL:, AssertionError) which always blocks retries, and "Generic" (ERROR:, Traceback) which can result from either real bugs or infrastructure crashes.

Now, if a transient crash is detected, the bot would leave the logs completely unaltered and safely "forgives" the generic markers so the auto-retry can successfully trigger.

I'll test this once in bot-testing-ground and update the PR.

stktyagi · 2026-04-21T01:33:33Z

@stktyagi this PR fails to explain the problem that is causing the issue in the first place. Whenever you open a PR, please explain the problem carefully.

As I do not know what exact problem was causing the auto-restart to not work, I cannot confidently know if this solves it properly.

Understood! As there was no preceding issue for this I must have missed issue explanation properly. I'll update the the description and make sure going forward

R^Clace regex stripping with strict/generic failure categorization

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/reusable-bot-ci-failure.yml:
- Line 55: The workflow checkout input currently sets "ref:
fix/transient-error-rerun" which will fail after the branch is deleted; change
the checkout step to use "ref: master" (or remove the ref input so it defaults
to the repository's default branch) where the current "ref:
fix/transient-error-rerun" appears to ensure trusted_scripts/analyze loads
correctly when merged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: bdbfdec8-815b-4250-8c44-ed74a311da7e

📥 Commits

Reviewing files that changed from the base of the PR and between 6a06036 and 7d5f12b.

📒 Files selected for processing (3)

.github/actions/bot-ci-failure/analyze_failure.py
.github/actions/bot-ci-failure/test_analyze_failure.py
.github/workflows/reusable-bot-ci-failure.yml

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)

GitHub Check: Python==3.12 | django~=4.2.0
GitHub Check: Python==3.12 | django~=5.0.0
GitHub Check: Python==3.12 | django~=5.1.0
GitHub Check: Python==3.10 | django~=5.1.0
GitHub Check: Python==3.11 | django~=4.2.0
GitHub Check: Python==3.10 | django~=5.0.0
GitHub Check: Python==3.13 | django~=5.1.0
GitHub Check: Python==3.13 | django~=5.2.0
GitHub Check: Python==3.10 | django~=4.2.0
GitHub Check: Python==3.12 | django~=5.2.0
GitHub Check: Python==3.10 | django~=5.2.0
GitHub Check: Python==3.11 | django~=5.0.0
GitHub Check: Python==3.11 | django~=5.2.0
GitHub Check: Python==3.11 | django~=5.1.0
GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (actions)
GitHub Check: Analyze (python)
GitHub Check: Kilo Code Review

🧰 Additional context used

🧠 Learnings (4)

📚 Learning: 2026-03-24T16:40:15.783Z

Learnt from: stktyagi
Repo: openwisp/openwisp-utils PR: 631
File: .github/workflows/bot-ci-failure.yml:71-74
Timestamp: 2026-03-24T16:40:15.783Z
Learning: In openwisp/openwisp-utils reusable-bot-ci-failure.yml, the auto-retry step uses a custom GitHub App token (generated from APP_ID and PRIVATE_KEY secrets) passed to the gh CLI via the GH_TOKEN environment variable. This token is NOT subject to the workflow's `permissions` block (which only restricts GITHUB_TOKEN). Therefore, having `actions: read` in the reusable workflow's permissions block does NOT break the auto-retry `gh api -X POST` call — it authenticates with the App's own permissions. Do not flag this as a permissions issue.

Applied to files:

.github/workflows/reusable-bot-ci-failure.yml

📚 Learning: 2026-03-05T20:55:29.968Z

Learnt from: nemesifier
Repo: openwisp/openwisp-utils PR: 0
File: :0-0
Timestamp: 2026-03-05T20:55:29.968Z
Learning: For the openwisp/openwisp-utils repository (and the broader openwisp org), the maintainers have explicitly decided to use mutable version tags (e.g. `v2`, `v6`) for GitHub Actions and to use `ref: master` for internal script checkouts. Do NOT flag unpinned action versions or `master`/version-tag usage as security issues in any openwisp-utils workflow file. The maintainers are aware of the trade-offs and are intentionally fine with this approach.

Applied to files:

.github/workflows/reusable-bot-ci-failure.yml

📚 Learning: 2026-03-05T09:38:10.320Z

Learnt from: pushpitkamboj
Repo: openwisp/openwisp-utils PR: 584
File: .github/workflows/reusable-bot-changelog.yml:49-49
Timestamp: 2026-03-05T09:38:10.320Z
Learning: In openwisp-utils, PR title prefixes are strictly limited to `[feature]`, `[fix]`, and `[change]` (exact bracketed tags, no scoping/sub-types). The regex `^\[(feature|fix|change)\]` in `.github/workflows/reusable-bot-changelog.yml` is intentional and correct — scoped variants like `[feature/bots]` are not valid and should not be matched.

Applied to files:

.github/workflows/reusable-bot-ci-failure.yml

📚 Learning: 2026-03-14T20:44:14.568Z

Learnt from: CR
Repo: openwisp/openwisp-utils PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-03-14T20:44:14.568Z
Learning: Cryptic or non-obvious code (regex, complex bash commands, or hard-to-read code) must include a concise comment explaining why it is needed and why the complexity is acceptable

Applied to files:

.github/actions/bot-ci-failure/analyze_failure.py

🔇 Additional comments (3)

.github/actions/bot-ci-failure/test_analyze_failure.py (1)

291-299: LGTM — directly exercises the regression this PR fixes.

Good coverage for the case where a transient infrastructure log contains the ERROR: keyword that previously blocked auto-retry. The assertions correctly validate both transient_only=True and tests_failed=False.

.github/actions/bot-ci-failure/analyze_failure.py (2)

9-24: Clean separation of strict vs. generic markers.

The split cleanly addresses the root cause: ERROR: alone should no longer force a test-failure classification when a transient marker is present. Keeping the combined TEST_FAILURE_MARKERS tuple preserves _extract_failed_tests behavior for block isolation. Comments are helpful.

159-184: Categorization logic is correct; the three-way branching and prompt rule are sound.

The implementation correctly separates strict ("FAIL:", "AssertionError") from generic ("ERROR:", "Traceback") markers. The transient counter correctly increments only when is_transient and not job_has_test_failure, preventing jobs with strict failures from being miscounted as transient-only.

The prompt rule at lines 393–395 is appropriately defensive for the scenario where a single job body contains both transient infrastructure noise and genuine test failures (e.g., both a network error AND an AssertionError). In this case, tests_failed=True but transient noise is still surfaced to the LLM, making the rule necessary to guide the contributor appropriately.

No other code references the combined TEST_FAILURE_MARKERS with assumptions about its composition, confirming the split into separate markers is clean.

Added 4 tests to cover each failure case and changed ref to master.

stktyagi · 2026-04-21T05:19:48Z

@nemesifier
I've manually tested for multiple failure cases here in a new PR for readability.

stktyagi · 2026-04-21T05:22:34Z

@coderabbitai review

coderabbitai · 2026-04-21T05:22:42Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

nemesifier

Thanks @stktyagi 🙏

[fix] Fix re-run mechanism for transient errors

ea267c6

Separated test errors and transient error marker keywords to allow re-run

stktyagi force-pushed the fix/transient-error-rerun branch from c415e07 to ea267c6 Compare April 19, 2026 18:57

stktyagi changed the title ~~Fix/transient error rerun~~ [ci] Fixed CI transient error re-run mechanism Apr 19, 2026

[fix] Added tests for transient error marking

698bf35

Added test to prevent contention between conventional error marking and transient error marking

nemesifier reviewed Apr 20, 2026

View reviewed changes

nemesifier assigned stktyagi and nemesifier Apr 20, 2026

[ci] Refactored transient re-run logic case

7d5f12b

R^Clace regex stripping with strict/generic failure categorization

coderabbitai Bot added bug github_actions Pull requests that update GitHub Actions code helper-bots Helper bots, release management automation labels Apr 21, 2026

coderabbitai Bot requested changes Apr 21, 2026

View reviewed changes

Comment thread .github/workflows/reusable-bot-ci-failure.yml Outdated

[ci] Added tests and change ref to master

c8256c1

Added 4 tests to cover each failure case and changed ref to master.

nemesifier approved these changes Apr 21, 2026

View reviewed changes

nemesifier merged commit 00bdd53 into master Apr 21, 2026
37 checks passed

nemesifier deleted the fix/transient-error-rerun branch April 21, 2026 15:12

coderabbitai Bot mentioned this pull request Apr 24, 2026

[fix] Removed 'FAILED (' from strict markers to unblock auto-retry #655

Open

4 tasks

Uh oh!

Conversation

stktyagi commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Description of Changes

Uh oh!

coderabbitai Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Pre-merge checks failed

❌ Failed checks (1 error)

Uh oh!

coveralls commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kilo-code-bot Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

stktyagi commented Apr 19, 2026

Uh oh!

coderabbitai Bot commented Apr 19, 2026

Uh oh!

stktyagi commented Apr 19, 2026

Uh oh!

nemesifier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nemesifier Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

stktyagi Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stktyagi commented Apr 21, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stktyagi commented Apr 21, 2026

Uh oh!

stktyagi commented Apr 21, 2026

Uh oh!

coderabbitai Bot commented Apr 21, 2026

Uh oh!

nemesifier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stktyagi commented Apr 19, 2026 •

edited

Loading

coderabbitai Bot commented Apr 19, 2026 •

edited

Loading

coveralls commented Apr 19, 2026 •

edited

Loading

kilo-code-bot Bot commented Apr 19, 2026 •

edited

Loading

stktyagi Apr 21, 2026 •

edited

Loading