Skip to content

[ci] Fixed CI transient error re-run mechanism#650

Merged
nemesifier merged 4 commits intomasterfrom
fix/transient-error-rerun
Apr 21, 2026
Merged

[ci] Fixed CI transient error re-run mechanism#650
nemesifier merged 4 commits intomasterfrom
fix/transient-error-rerun

Conversation

@stktyagi
Copy link
Copy Markdown
Member

@stktyagi stktyagi commented Apr 19, 2026

Checklist

  • I have read the OpenWISP Contributing Guidelines.
  • I have manually tested the changes proposed in this pull request.
  • I have written new test cases for new code and/or updated existing tests for changes to existing code.
  • I have updated the documentation.

Description of Changes

The auto-retry failed to trigger on transient infrastructure crashes (like network drops) because their log outputs contained the word "ERROR:". Since "ERROR:" was globally listed as a test failure marker, the script falsely flagged these transient issues as real code bugs and aborted the retry. (e.g., "ERROR: Could not install packages due to an OSError").
To fix this:
Failure markers are categorized into Strict (e.g., FAIL:, which always block retries) and Generic (e.g., "ERROR:", which can be bugs or infrastructure crashes). Now, if the script detects a known transient crash, it safely "forgives" any generic markers and successfully triggers the auto-retry.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 19, 2026

Warning

Rate limit exceeded

@stktyagi has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 29 minutes and 22 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 29 minutes and 22 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 130791e2-547b-45d1-89f5-4c1af8fe9b77

📥 Commits

Reviewing files that changed from the base of the PR and between 7d5f12b and c8256c1.

📒 Files selected for processing (1)
  • .github/actions/bot-ci-failure/test_analyze_failure.py
📝 Walkthrough

Walkthrough

The PR refines the CI failure bot's log classification logic to distinguish between strict test failure markers (FAIL:, FAILED(, AssertionError), generic error markers (ERROR:, Traceback), and transient infrastructure markers. The process_error_logs function now applies conditional logic to determine whether a job represents real test failures or transient issues, with strict markers always forcing a test-failure classification and transient markers suppressing generic error interpretation. A corresponding unit test verifies transient marker detection. The reusable workflow is updated to checkout from a different branch. The LLM system prompt is enhanced with instructions to prioritize real test failures over infrastructure issues when both exist.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

bug, github_actions, helper-bots


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Bug Fixes ❌ Error Workflow file references temporary feature branch fix/transient-error-rerun instead of master, will break after branch deletion. Update .github/workflows/reusable-bot-ci-failure.yml line 55 to change ref: fix/transient-error-rerun to ref: master before merge.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title follows the required format [type] with descriptive text and directly addresses the main change: fixing the CI transient error re-run mechanism.
Description check ✅ Passed The description covers the key sections including completed checklist items, clear explanation of the issue and fix approach, but lacks a reference to a linked issue number.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/transient-error-rerun

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coveralls
Copy link
Copy Markdown

coveralls commented Apr 19, 2026

Coverage Status

coverage: 97.529%. remained the same — fix/transient-error-rerun into master

@kilo-code-bot
Copy link
Copy Markdown

kilo-code-bot Bot commented Apr 19, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Excellent improvements in this iteration:

  1. Comprehensive test coverage - 4 new tests covering all scenarios:

    • Pure transient errors allow retry
    • Transient crashes with generic tracebacks don't falsely trigger failures
    • Strict failures block retry even with transient errors present
    • Generic tracebacks without transient errors are flagged as real bugs
  2. Workflow ready for merge - ref restored to master

The two-tier marker approach is well-tested and ready for production.

Files Reviewed (3 files)
  • .github/actions/bot-ci-failure/analyze_failure.py - Two-tier marker logic
  • .github/workflows/reusable-bot-ci-failure.yml - Restored to master ref
  • .github/actions/bot-ci-failure/test_analyze_failure.py - Comprehensive test coverage

Reviewed by kimi-k2.5-0127 · 149,977 tokens

Separated test errors and transient error marker keywords to allow re-run
@stktyagi stktyagi force-pushed the fix/transient-error-rerun branch from c415e07 to ea267c6 Compare April 19, 2026 18:57
@stktyagi
Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 19, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@stktyagi
Copy link
Copy Markdown
Member Author

tested retry mechanism here

image
image

@stktyagi stktyagi changed the title Fix/transient error rerun [ci] Fixed CI transient error re-run mechanism Apr 19, 2026
Added test to prevent contention between conventional error marking and transient error marking
Copy link
Copy Markdown
Member

@nemesifier nemesifier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stktyagi this PR fails to explain the problem that is causing the issue in the first place.
Whenever you open a PR, please explain the problem carefully.

As I do not know what exact problem was causing the auto-restart to not work, I cannot confidently know if this solves it properly.

Comment thread .github/actions/bot-ci-failure/analyze_failure.py
if is_transient:
for t_marker in TRANSIENT_FAILURE_MARKERS:
pattern = re.compile(re.escape(t_marker), re.IGNORECASE)
cleaned_body = pattern.sub("", cleaned_body)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you stripping out the transient marker right?

My concern is: the transient markers do not contain full errors, just part of it, does it make sense to strip something only partially?

Copy link
Copy Markdown
Member Author

@stktyagi stktyagi Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it was mostly an issue related to the test failure keywords blocking the transient re-run it does fix most failure cases but I asked AI to create edge cases where it would fail and one particular edge case is
A network crash prints: "Traceback (most recent call last): ...

Because of errors like this, instead of stripping we can try this approach -

  1. We can categorize failure markers into "Strict" (FAIL:, AssertionError) which always blocks retries, and "Generic" (ERROR:, Traceback) which can result from either real bugs or infrastructure crashes.
  2. Now, if a transient crash is detected, the bot would leave the logs completely unaltered and safely "forgives" the generic markers so the auto-retry can successfully trigger.

I'll test this once in bot-testing-ground and update the PR.

@stktyagi
Copy link
Copy Markdown
Member Author

@stktyagi this PR fails to explain the problem that is causing the issue in the first place. Whenever you open a PR, please explain the problem carefully.

As I do not know what exact problem was causing the auto-restart to not work, I cannot confidently know if this solves it properly.

Understood! As there was no preceding issue for this I must have missed issue explanation properly. I'll update the the description and make sure going forward

R^Clace regex stripping with strict/generic failure categorization
@coderabbitai coderabbitai Bot added bug github_actions Pull requests that update GitHub Actions code helper-bots Helper bots, release management automation labels Apr 21, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/reusable-bot-ci-failure.yml:
- Line 55: The workflow checkout input currently sets "ref:
fix/transient-error-rerun" which will fail after the branch is deleted; change
the checkout step to use "ref: master" (or remove the ref input so it defaults
to the repository's default branch) where the current "ref:
fix/transient-error-rerun" appears to ensure trusted_scripts/analyze loads
correctly when merged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: bdbfdec8-815b-4250-8c44-ed74a311da7e

📥 Commits

Reviewing files that changed from the base of the PR and between 6a06036 and 7d5f12b.

📒 Files selected for processing (3)
  • .github/actions/bot-ci-failure/analyze_failure.py
  • .github/actions/bot-ci-failure/test_analyze_failure.py
  • .github/workflows/reusable-bot-ci-failure.yml
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: Python==3.12 | django~=4.2.0
  • GitHub Check: Python==3.12 | django~=5.0.0
  • GitHub Check: Python==3.12 | django~=5.1.0
  • GitHub Check: Python==3.10 | django~=5.1.0
  • GitHub Check: Python==3.11 | django~=4.2.0
  • GitHub Check: Python==3.10 | django~=5.0.0
  • GitHub Check: Python==3.13 | django~=5.1.0
  • GitHub Check: Python==3.13 | django~=5.2.0
  • GitHub Check: Python==3.10 | django~=4.2.0
  • GitHub Check: Python==3.12 | django~=5.2.0
  • GitHub Check: Python==3.10 | django~=5.2.0
  • GitHub Check: Python==3.11 | django~=5.0.0
  • GitHub Check: Python==3.11 | django~=5.2.0
  • GitHub Check: Python==3.11 | django~=5.1.0
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (actions)
  • GitHub Check: Analyze (python)
  • GitHub Check: Kilo Code Review
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2026-03-24T16:40:15.783Z
Learnt from: stktyagi
Repo: openwisp/openwisp-utils PR: 631
File: .github/workflows/bot-ci-failure.yml:71-74
Timestamp: 2026-03-24T16:40:15.783Z
Learning: In openwisp/openwisp-utils reusable-bot-ci-failure.yml, the auto-retry step uses a custom GitHub App token (generated from APP_ID and PRIVATE_KEY secrets) passed to the gh CLI via the GH_TOKEN environment variable. This token is NOT subject to the workflow's `permissions` block (which only restricts GITHUB_TOKEN). Therefore, having `actions: read` in the reusable workflow's permissions block does NOT break the auto-retry `gh api -X POST` call — it authenticates with the App's own permissions. Do not flag this as a permissions issue.

Applied to files:

  • .github/workflows/reusable-bot-ci-failure.yml
📚 Learning: 2026-03-05T20:55:29.968Z
Learnt from: nemesifier
Repo: openwisp/openwisp-utils PR: 0
File: :0-0
Timestamp: 2026-03-05T20:55:29.968Z
Learning: For the openwisp/openwisp-utils repository (and the broader openwisp org), the maintainers have explicitly decided to use mutable version tags (e.g. `v2`, `v6`) for GitHub Actions and to use `ref: master` for internal script checkouts. Do NOT flag unpinned action versions or `master`/version-tag usage as security issues in any openwisp-utils workflow file. The maintainers are aware of the trade-offs and are intentionally fine with this approach.

Applied to files:

  • .github/workflows/reusable-bot-ci-failure.yml
📚 Learning: 2026-03-05T09:38:10.320Z
Learnt from: pushpitkamboj
Repo: openwisp/openwisp-utils PR: 584
File: .github/workflows/reusable-bot-changelog.yml:49-49
Timestamp: 2026-03-05T09:38:10.320Z
Learning: In openwisp-utils, PR title prefixes are strictly limited to `[feature]`, `[fix]`, and `[change]` (exact bracketed tags, no scoping/sub-types). The regex `^\[(feature|fix|change)\]` in `.github/workflows/reusable-bot-changelog.yml` is intentional and correct — scoped variants like `[feature/bots]` are not valid and should not be matched.

Applied to files:

  • .github/workflows/reusable-bot-ci-failure.yml
📚 Learning: 2026-03-14T20:44:14.568Z
Learnt from: CR
Repo: openwisp/openwisp-utils PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-03-14T20:44:14.568Z
Learning: Cryptic or non-obvious code (regex, complex bash commands, or hard-to-read code) must include a concise comment explaining why it is needed and why the complexity is acceptable

Applied to files:

  • .github/actions/bot-ci-failure/analyze_failure.py
🔇 Additional comments (3)
.github/actions/bot-ci-failure/test_analyze_failure.py (1)

291-299: LGTM — directly exercises the regression this PR fixes.

Good coverage for the case where a transient infrastructure log contains the ERROR: keyword that previously blocked auto-retry. The assertions correctly validate both transient_only=True and tests_failed=False.

.github/actions/bot-ci-failure/analyze_failure.py (2)

9-24: Clean separation of strict vs. generic markers.

The split cleanly addresses the root cause: ERROR: alone should no longer force a test-failure classification when a transient marker is present. Keeping the combined TEST_FAILURE_MARKERS tuple preserves _extract_failed_tests behavior for block isolation. Comments are helpful.


159-184: Categorization logic is correct; the three-way branching and prompt rule are sound.

The implementation correctly separates strict ("FAIL:", "AssertionError") from generic ("ERROR:", "Traceback") markers. The transient counter correctly increments only when is_transient and not job_has_test_failure, preventing jobs with strict failures from being miscounted as transient-only.

The prompt rule at lines 393–395 is appropriately defensive for the scenario where a single job body contains both transient infrastructure noise and genuine test failures (e.g., both a network error AND an AssertionError). In this case, tests_failed=True but transient noise is still surfaced to the LLM, making the rule necessary to guide the contributor appropriately.

No other code references the combined TEST_FAILURE_MARKERS with assumptions about its composition, confirming the split into separate markers is clean.

Comment thread .github/workflows/reusable-bot-ci-failure.yml Outdated
Added 4 tests to cover each failure case and changed ref to master.
@stktyagi
Copy link
Copy Markdown
Member Author

@nemesifier
I've manually tested for multiple failure cases here in a new PR for readability.

@stktyagi
Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Member

@nemesifier nemesifier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stktyagi 🙏

@nemesifier nemesifier merged commit 00bdd53 into master Apr 21, 2026
37 checks passed
@nemesifier nemesifier deleted the fix/transient-error-rerun branch April 21, 2026 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug github_actions Pull requests that update GitHub Actions code helper-bots Helper bots, release management automation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants