[fix] Removed 'FAILED (' from strict markers to unblock auto-retry#655
[fix] Removed 'FAILED (' from strict markers to unblock auto-retry#655
Conversation
Removed 'FAILED' keyword from strict failure markers which caused blocking of re-run mechanism
📝 WalkthroughWalkthroughTwo changes in the CI failure analysis logic and its tests: the Sequence Diagram(s)(omitted) Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review SummaryStatus: No Issues Found | Recommendation: Merge OverviewThe change removes Incremental Changes (since last review):
Analysis:
Files Reviewed (2 files)
Reviewed by kimi-k2.5-0127 · 134,461 tokens |
Ensures that the standard 'FAILED (errors=x)' summary appended by the unittest framework does not falsely override transient crash detection and block the auto-retry mechanism.
Test Failures in CIHello @stktyagi, The CI pipeline failed due to a test failure in the Failure: Fix: |
Added WebDriverException in transient failure error markers
Added WebDriverException in transient failure error markers
Test Failures in CIHello @stktyagi, There are 3 test failures in the CI logs:
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
@coderabbitai resolve |
✅ Actions performedComments resolved and changes approved. |
nemesifier
left a comment
There was a problem hiding this comment.
@stktyagi I am thinking.. we are trying to figure out which of these 3 cases we're in:
- tests failed due to flakyness but also has real test failures
- tests failed only due to flakyness
- tests failed only due to real test failures
I think the problem is mainly discern between 1 and 2. Implementing this with text patterns and make it work across different programming languages (python, lua, nodejs, etc) will be super tricky.
Maybe we can make it work for Python, but I am skeptical this will work also for the nodejs repos.
Are you sure this is the right approach?
Why don't we give all the context to the LLM and ask it to tell us if we are under case 1 or 2? The LLM should be good at this. We could do this in a separate API request, which has the sole goal of understanding if we are falling under case 1 or case 2, based on the result of the response we'll either restart the build or not, what do you think of this?
My main concern with that approach would be handling hallucinations, we would need to be deterministic with whether a CI failure is due to actual test failure or flakyness. Also, while Node.js and Python format tracebacks differently, the actual infrastructure crashes we want to forgive are mostly consistent across all CIs and even if we encounter new ones, I feel including them iteratively until we reach max coverage seems fine. It feels like a trade-off :- Non-deterministic approach would be language versatile but might hallucinate, whereas current approach would be deterministic but we would need to consistently monitor new transient error inclusions and iterations. Depends which seems better to us. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/actions/bot-ci-failure/test_analyze_failure.py:
- Around line 341-354: Add a negative regression test to ensure a unittest
summary "FAILED (errors=1)" does not cancel a real error classification when
there is a non-transient "ERROR:" traceback; create a new test (e.g.,
test_unittest_failed_with_real_error_blocks_retry) that builds a log string
containing a real "ERROR:"/traceback (no transient WebDriverException markers)
plus the "FAILED (errors=1)" summary, call process_error_logs and assert
tests_failed is True and transient_only is False; reference the existing
test_transient_ignores_unittest_failed_summary and the flags
has_strict_failure/is_transient in analyze_failure.py as the behavior you are
locking in.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 36263c46-7352-42ba-8d54-5612a5a4b122
📒 Files selected for processing (2)
.github/actions/bot-ci-failure/analyze_failure.py.github/actions/bot-ci-failure/test_analyze_failure.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Python==3.13 | django~=5.1.0
- GitHub Check: Python==3.11 | django~=5.0.0
- GitHub Check: Python==3.10 | django~=5.2.0
- GitHub Check: Python==3.13 | django~=5.2.0
- GitHub Check: Python==3.12 | django~=5.0.0
- GitHub Check: Python==3.10 | django~=4.2.0
- GitHub Check: Python==3.11 | django~=5.2.0
- GitHub Check: Python==3.10 | django~=5.1.0
- GitHub Check: Python==3.12 | django~=5.2.0
- GitHub Check: Python==3.12 | django~=4.2.0
- GitHub Check: Python==3.11 | django~=5.1.0
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: openwisp/openwisp-utils PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-03-14T20:44:14.568Z
Learning: Bug Fixes: Ensure the test is deterministic and not flaky - flag tests that depend on timing, sleeps, specific timezones, system time, randomness without fixed seed, race conditions, concurrency timing, network access, external services, filesystem state, environment-specific configuration, execution order, shared global state, hardcoded ports, or unawaited async operations
Learnt from: pushpitkamboj
Repo: openwisp/openwisp-utils PR: 584
File: .github/workflows/reusable-bot-changelog.yml:49-49
Timestamp: 2026-03-05T09:38:10.320Z
Learning: In openwisp-utils, PR title prefixes are strictly limited to `[feature]`, `[fix]`, and `[change]` (exact bracketed tags, no scoping/sub-types). The regex `^\[(feature|fix|change)\]` in `.github/workflows/reusable-bot-changelog.yml` is intentional and correct — scoped variants like `[feature/bots]` are not valid and should not be matched.
Learnt from: CR
Repo: openwisp/openwisp-utils PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-03-14T20:44:14.568Z
Learning: Features: Add tests for new features and ensure coverage does not decrease significantly; prefer Selenium browser tests for UI-impacting features
Learnt from: CR
Repo: openwisp/openwisp-utils PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-03-14T20:44:14.568Z
Learning: Bug Fixes: If the bug affects the user interface, include a Selenium browser test; if missing, raise a warning
🔇 Additional comments (2)
.github/actions/bot-ci-failure/analyze_failure.py (2)
10-13: Removal of"FAILED ("from strict markers looks safe.Genuine unittest failures still emit per-test
FAIL: <test>orERROR: <test>lines (caught bySTRICT_TEST_FAILURE_MARKERS/GENERIC_TEST_FAILURE_MARKERS), and assertion bugs still surface viaAssertionError. The terminalFAILED (errors=N, failures=M)line is purely a summary, so dropping it from the strict set unblocks the transient-retry path without weakening detection of real failures. The new regression test intest_analyze_failure.pypins this behavior.
37-37: Adding base-classWebDriverExceptionas a transient marker — accepting the false-positive trade-off.
selenium.common.exceptions.WebDriverExceptionis the base class of most Selenium exceptions (NoSuchElementException,ElementClickInterceptedException,TimeoutException, …). In practice that's fine for substring matching because Python tracebacks print the concrete subclass FQN, so this marker will only fire when the runtime raisesWebDriverExceptiondirectly (typical for browser/marionette crashes,about:neterror, session teardown).However, it does mean any future code path or test helper that raises a bare
WebDriverException— including legitimate test logic bugs — will now be classified as transient and will forgive co-locatedERROR:/Tracebackmarkers in the same job, potentially masking a real failure and triggering an auto-retry loop. This iterative pattern-matching trade-off has been accepted.Minor note:
selenium.common.exceptions.InvalidSessionIdExceptionon line 36 is aWebDriverExceptionsubclass, but both entries are needed because tracebacks print the concrete subclass, so explicit subclass entries remain useful for matching.
| def test_transient_ignores_unittest_failed_summary(self): | ||
| """Ensure unittest's 'FAILED (errors=1)' summary does not override transient crashes.""" | ||
| content = ( | ||
| "===== JOB 5 =====\n" | ||
| "Traceback (most recent call last):\n" | ||
| "selenium.common.exceptions.WebDriverException: Message: Reached error page: about:neterror\n" | ||
| "----------------------------------------------------------------------\n" | ||
| "Ran 367 tests in 311.148s\n\n" | ||
| "FAILED (errors=1)\n" | ||
| ) | ||
| text, tests_failed, transient_only = process_error_logs(content) | ||
| self.assertFalse(tests_failed) | ||
| self.assertTrue(transient_only) | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
LGTM — regression test correctly pins the fix.
The scenario exercises exactly the buggy classification path: transient WebDriverException plus a generic Traceback plus the unittest FAILED (errors=1) summary. With the marker changes in analyze_failure.py, has_strict_failure is False, is_transient is True, so the transient branch forgives the generic traceback and the unittest summary, yielding tests_failed=False, transient_only=True as asserted.
One small follow-up worth considering (optional): also add a negative test where FAILED (errors=1) appears with a real ERROR: traceback but no transient marker, asserting tests_failed=True. test_pure_generic_bug_blocks_retry covers the spirit of this, but a variant that explicitly contains the FAILED (errors=N) summary would lock in the contract that the summary line alone never flips classification either way.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/actions/bot-ci-failure/test_analyze_failure.py around lines 341 -
354, Add a negative regression test to ensure a unittest summary "FAILED
(errors=1)" does not cancel a real error classification when there is a
non-transient "ERROR:" traceback; create a new test (e.g.,
test_unittest_failed_with_real_error_blocks_retry) that builds a log string
containing a real "ERROR:"/traceback (no transient WebDriverException markers)
plus the "FAILED (errors=1)" summary, call process_error_logs and assert
tests_failed is True and transient_only is False; reference the existing
test_transient_ignores_unittest_failed_summary and the flags
has_strict_failure/is_transient in analyze_failure.py as the behavior you are
locking in.
Removed 'FAILED' keyword from strict failure markers which caused blocking of re-run mechanism
Checklist
Description of Changes
Existence of 'FAILED' keyword in strict failure markers blocked re-run mechanism due its occurence in any case of failure in CI failure logs.