Skip to content

fix: improve worker retry UX and suppress raw tracebacks (#908)#916

Open
PythonFZ wants to merge 8 commits intomainfrom
fix/908-worker-retry-ux
Open

fix: improve worker retry UX and suppress raw tracebacks (#908)#916
PythonFZ wants to merge 8 commits intomainfrom
fix/908-worker-retry-ux

Conversation

@PythonFZ
Copy link
Copy Markdown
Member

@PythonFZ PythonFZ commented Apr 10, 2026

Summary

  • Add _OutageState dataclass to centralize outage tracking across _heartbeat_loop and _claim_loop
  • Retry logging now shows elapsed/max countdown: Server unreachable — retrying (30s/120s elapsed)
  • Final shutdown uses logger.error() instead of logger.exception() — no raw tracebacks for expected connection errors
  • Exponential backoff in _claim_loop during sustained outage (2s → 4s → 8s → 10s cap)
  • Rate-limited log output deduplicates messages across both loops

Closes #908

Test plan

  • 9 unit tests for _OutageState (injectable clock, no server/threads needed)
  • Integration test: heartbeat loop logs countdown format and exits cleanly
  • Integration test: claim loop shuts down with no raw tracebacks
  • Outage recovery test: state resets after server comes back
  • 318 existing tests pass (no regressions)
  • Pre-commit (ruff, ruff-format, codespell) all pass

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Improved worker resilience with coordinated shutdown logic during server outages
    • Exponential backoff for retries when server is unreachable (capped at 10 seconds)
    • Cleaner outage logging without raw tracebacks and countdown-style warning messages
  • Documentation

    • Added design specification for worker shutdown and retry user experience
  • Tests

    • Comprehensive test coverage for outage tracking, recovery, and logging behavior

PythonFZ and others added 8 commits April 10, 2026 08:55
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements _OutageState dataclass in client.py for centralized outage
tracking with injectable clock for deterministic testing. All 9 unit
tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…908)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace old-style warning/exception logging in _claim_loop with
_OutageState.record_failure()/.should_shutdown()/.should_log() and
add exponential backoff (capped at 10s) matching the heartbeat loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 10, 2026

📝 Walkthrough

Walkthrough

Introduces a shared _OutageState dataclass to centralize server-unreachability tracking across heartbeat and claim loops, replacing per-loop state tracking. Implements coordinated failure handling with shutdown thresholds, rate-limited logging with elapsed/max-time formatting, and exponential backoff during sustained outages. Includes comprehensive unit and integration tests.

Changes

Cohort / File(s) Summary
Core Outage State Implementation
src/zndraw_joblib/client.py
Added _OutageState dataclass for centralized failure/success tracking, elapsed duration computation, shutdown thresholds, and rate-limited logging. Refactored _heartbeat_loop and _claim_loop to use shared _outage instance with record_failure(), record_success(), should_shutdown(), and should_log() calls. Introduced exponential backoff in _claim_loop during outages (capped at 10s). Removed prior _last_server_contact, _contact_lock, and helper methods.
Test Suite
tests/zndraw_joblib/test_outage_state.py
New test module with unit tests for _OutageState lifecycle (elapsed time, shutdown threshold, rate-limited logging, no outage-start reset on repeated failures). Integration tests verifying _heartbeat_loop and _claim_loop emit expected WARNING/ERROR logs without raw tracebacks during simulated connection failures and validating clean shutdown messaging.
Design & Planning Documentation
docs/superpowers/specs/2026-04-10-worker-retry-ux-design.md, docs/superpowers/plans/2026-04-10-worker-retry-ux.md
Specification and planning documents defining the Worker Shutdown Retry UX design, state transitions, unified exception handling, expected log format (e.g., "Server unreachable — retrying (Xs/Ys elapsed)"), and comprehensive test plan.

Sequence Diagram(s)

sequenceDiagram
    participant HB as _heartbeat_loop
    participant CL as _claim_loop
    participant OS as _OutageState
    participant Log as Logger

    rect rgba(200, 150, 255, 0.5)
    Note over HB,CL: Connection Established
    HB->>OS: record_success()
    OS-->>HB: ✓
    CL->>OS: record_success()
    OS-->>CL: ✓
    end

    rect rgba(255, 100, 100, 0.5)
    Note over HB,CL: Server Goes Down
    HB->>OS: record_failure()
    OS-->>HB: outage started
    CL->>OS: record_failure()
    OS-->>CL: outage shared
    end

    rect rgba(255, 180, 100, 0.5)
    Note over HB,CL: Ongoing Outage (rate-limited logging)
    loop Every ~30s (HB) / ~2s backoff (CL)
        HB->>OS: should_log(min_interval=60)
        alt First call or interval elapsed
            OS-->>HB: True
            HB->>Log: warning("Server unreachable — retrying (30s/120s elapsed)")
        else Recent log already sent
            OS-->>HB: False
            Note over HB: Skip redundant log
        end
        CL->>OS: should_log(min_interval=60)
        OS-->>CL: True/False (rate-limited)
    end
    end

    rect rgba(100, 150, 255, 0.5)
    Note over HB,CL: Threshold Exceeded
    HB->>OS: should_shutdown()
    OS-->>HB: True (elapsed > 120s)
    HB->>Log: error("Server unreachable for >120s, shutting down")
    HB->>HB: _stop.set()
    CL->>OS: should_shutdown()
    OS-->>CL: True
    CL->>Log: error("Server unreachable for >120s, shutting down")
    CL->>CL: _stop.set()
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

The changes introduce a new foundational class with moderate logic density (state transitions, clock-based timing, thread-safe locking), refactor two related loop handlers following a consistent pattern, and include a comprehensive test suite validating both unit-level and integration-level behavior. While the pattern is somewhat repetitive across the two loops, the new _OutageState logic and its interactions with should_shutdown() / should_log() require careful reasoning about timing semantics and thread safety.

Poem

🐰 Hops through the outage with pride so tall,
No more raw tracebacks to haunt us all!
With _OutageState keeping time fair and true,
Backoff and rate-limits make logging less blue—
Coordination, countdown, and clean exits shine,
Worker resilience: now that's divine!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: improving worker retry UX and suppressing raw tracebacks, both primary objectives of the PR.
Linked Issues check ✅ Passed All acceptance criteria from issue #908 are met: retry logging with elapsed/max countdown, logger.error() instead of exception(), exponential backoff, no raw tracebacks, and comprehensive unit/integration tests.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #908 objectives: _OutageState implementation, retry UX improvements, backoff logic, and logging refactoring. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/908-worker-retry-ux

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/superpowers/specs/2026-04-10-worker-retry-ux-design.md`:
- Around line 55-60: Add a language identifier to the fenced code block that
shows retry logs in the worker retry UX spec so it won’t trigger MD040; edit the
triple-backtick that opens the block (the block containing the three [WARNING]
lines and the [ERROR] line) and change it to include a language such as text or
log (e.g., ```text) so the markdown linter accepts the snippet.

In `@src/zndraw_joblib/client.py`:
- Around line 139-142: The outage recovery path (record_success) clears
_outage_start but leaves _last_log_time, so a subsequent new outage can inherit
the previous rate-limit; update record_success to also reset _last_log_time
(e.g. set it to None) when clearing an outage, and likewise when you begin a new
outage (the method that sets _outage_start in the 155-162 block, e.g.,
record_failure/_begin_outage) initialize or clear _last_log_time so the
log-throttle is reset and the first retry warning after a recovery is not
suppressed.
- Around line 751-758: The shutdown path in the heartbeat handler calls
self._stop.set() but doesn't wake the blocked claim loop, so _claim_loop() can
remain sleeping on self._task_ready.wait(timeout=wait) and delay teardown; after
calling self._stop.set() in the block where self._outage.should_shutdown() is
true (the logger.error / self._stop.set() branch), also notify the claim loop by
triggering the task-ready event (e.g. call self._task_ready.set() or equivalent)
so _claim_loop() wakes immediately and can exit; locate the shutdown branch
using _outage.should_shutdown(), logger.error, and self._stop.set() to add the
wakeup call.

In `@tests/zndraw_joblib/test_outage_state.py`:
- Around line 120-123: The test starts the worker thread with
threading.Thread(target=manager._heartbeat_loop, daemon=True) and calls
t.join(timeout=5.0), but join with a timeout doesn't fail if the thread is still
running; update the test to explicitly assert the thread exited by adding an
assertion like assert not t.is_alive() after the join to ensure
manager._heartbeat_loop terminated cleanly; apply the same change for the other
occurrence that uses t.join(timeout=5.0) (lines around the second case) so both
tests validate the thread actually stopped.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eb9a0b06-bb67-4195-972f-d5f3ee339fff

📥 Commits

Reviewing files that changed from the base of the PR and between bb2198a and daf1de7.

📒 Files selected for processing (4)
  • docs/superpowers/plans/2026-04-10-worker-retry-ux.md
  • docs/superpowers/specs/2026-04-10-worker-retry-ux-design.md
  • src/zndraw_joblib/client.py
  • tests/zndraw_joblib/test_outage_state.py

Comment on lines +55 to +60
```
[WARNING] Server unreachable — retrying (5s/120s elapsed)
[WARNING] Server unreachable — retrying (15s/120s elapsed)
[WARNING] Server unreachable — retrying (30s/120s elapsed)
[ERROR] Server unreachable for >120s, shutting down. Last error: Connection refused
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language to this fenced block.

This trips MD040 and can fail markdown lint in docs-only changes. text or log would both work here.

🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 55-55: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/superpowers/specs/2026-04-10-worker-retry-ux-design.md` around lines 55
- 60, Add a language identifier to the fenced code block that shows retry logs
in the worker retry UX spec so it won’t trigger MD040; edit the triple-backtick
that opens the block (the block containing the three [WARNING] lines and the
[ERROR] line) and change it to include a language such as text or log (e.g.,
```text) so the markdown linter accepts the snippet.

Comment on lines +139 to +142
def record_success(self) -> None:
"""Server responded — reset outage state."""
with self._lock:
self._outage_start = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reset the log throttle when the outage clears.

record_success() clears _outage_start, but it leaves _last_log_time intact. If the server briefly recovers and then drops again within the 5s window, the new outage inherits the previous rate limit and its first retry warning is suppressed.

💡 Suggested fix
     def record_success(self) -> None:
         """Server responded — reset outage state."""
         with self._lock:
             self._outage_start = None
+            self._last_log_time = float("-inf")

Also applies to: 155-162

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/zndraw_joblib/client.py` around lines 139 - 142, The outage recovery path
(record_success) clears _outage_start but leaves _last_log_time, so a subsequent
new outage can inherit the previous rate-limit; update record_success to also
reset _last_log_time (e.g. set it to None) when clearing an outage, and likewise
when you begin a new outage (the method that sets _outage_start in the 155-162
block, e.g., record_failure/_begin_outage) initialize or clear _last_log_time so
the log-throttle is reset and the first retry warning after a recovery is not
suppressed.

Comment on lines +751 to 758
if self._outage.should_shutdown():
logger.error( # noqa: TRY400 — intentionally no traceback
"Server unreachable for >%ss, shutting down. Last error: %s",
self._max_unreachable_seconds,
e,
)
self._stop.set()
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wake the claim loop when heartbeat forces shutdown.

This path only sets _stop. During an outage, _claim_loop() is blocked on _task_ready.wait(timeout=wait), so it can stay asleep for the full backoff window before teardown completes. With the new 10s cap, that delay becomes user-visible.

💡 Suggested fix
                 if self._outage.should_shutdown():
                     logger.error(  # noqa: TRY400 — intentionally no traceback
                         "Server unreachable for >%ss, shutting down. Last error: %s",
                         self._max_unreachable_seconds,
                         e,
                     )
                     self._stop.set()
+                    self._task_ready.set()
                     return
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self._outage.should_shutdown():
logger.error( # noqa: TRY400 — intentionally no traceback
"Server unreachable for >%ss, shutting down. Last error: %s",
self._max_unreachable_seconds,
e,
)
self._stop.set()
return
if self._outage.should_shutdown():
logger.error( # noqa: TRY400 — intentionally no traceback
"Server unreachable for >%ss, shutting down. Last error: %s",
self._max_unreachable_seconds,
e,
)
self._stop.set()
self._task_ready.set()
return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/zndraw_joblib/client.py` around lines 751 - 758, The shutdown path in the
heartbeat handler calls self._stop.set() but doesn't wake the blocked claim
loop, so _claim_loop() can remain sleeping on
self._task_ready.wait(timeout=wait) and delay teardown; after calling
self._stop.set() in the block where self._outage.should_shutdown() is true (the
logger.error / self._stop.set() branch), also notify the claim loop by
triggering the task-ready event (e.g. call self._task_ready.set() or equivalent)
so _claim_loop() wakes immediately and can exit; locate the shutdown branch
using _outage.should_shutdown(), logger.error, and self._stop.set() to add the
wakeup call.

Comment on lines +120 to +123
t = threading.Thread(target=manager._heartbeat_loop, daemon=True)
t.start()
t.join(timeout=5.0)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assert that the worker thread actually exited.

join(timeout=5.0) is non-failing by itself, so these tests still pass if the loop hangs after producing the expected log lines. An explicit assert not t.is_alive() would lock in the clean-exit behavior this PR is targeting.

💡 Suggested fix
             t = threading.Thread(target=manager._heartbeat_loop, daemon=True)
             t.start()
             t.join(timeout=5.0)
+            assert not t.is_alive()
             t = threading.Thread(target=manager._claim_loop, daemon=True)
             t.start()
             t.join(timeout=5.0)
+            assert not t.is_alive()

Also applies to: 153-155

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/zndraw_joblib/test_outage_state.py` around lines 120 - 123, The test
starts the worker thread with threading.Thread(target=manager._heartbeat_loop,
daemon=True) and calls t.join(timeout=5.0), but join with a timeout doesn't fail
if the thread is still running; update the test to explicitly assert the thread
exited by adding an assertion like assert not t.is_alive() after the join to
ensure manager._heartbeat_loop terminated cleanly; apply the same change for the
other occurrence that uses t.join(timeout=5.0) (lines around the second case) so
both tests validate the thread actually stopped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Worker shutdown on server-down: improve retry UX and suppress raw tracebacks

1 participant