fix: improve worker retry UX and suppress raw tracebacks (#908) by PythonFZ · Pull Request #916 · zincware/ZnDraw

PythonFZ · 2026-04-10T07:25:35Z

Summary

Add _OutageState dataclass to centralize outage tracking across _heartbeat_loop and _claim_loop
Retry logging now shows elapsed/max countdown: Server unreachable — retrying (30s/120s elapsed)
Final shutdown uses logger.error() instead of logger.exception() — no raw tracebacks for expected connection errors
Exponential backoff in _claim_loop during sustained outage (2s → 4s → 8s → 10s cap)
Rate-limited log output deduplicates messages across both loops

Closes #908

Test plan

9 unit tests for _OutageState (injectable clock, no server/threads needed)
Integration test: heartbeat loop logs countdown format and exits cleanly
Integration test: claim loop shuts down with no raw tracebacks
Outage recovery test: state resets after server comes back
318 existing tests pass (no regressions)
Pre-commit (ruff, ruff-format, codespell) all pass

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Improved worker resilience with coordinated shutdown logic during server outages
- Exponential backoff for retries when server is unreachable (capped at 10 seconds)
- Cleaner outage logging without raw tracebacks and countdown-style warning messages
Documentation
- Added design specification for worker shutdown and retry user experience
Tests
- Comprehensive test coverage for outage tracking, recovery, and logging behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Implements _OutageState dataclass in client.py for centralized outage tracking with injectable clock for deterministic testing. All 9 unit tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…908) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace old-style warning/exception logging in _claim_loop with _OutageState.record_failure()/.should_shutdown()/.should_log() and add exponential backoff (capped at 10s) matching the heartbeat loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-10T07:25:50Z

📝 Walkthrough

Walkthrough

Introduces a shared _OutageState dataclass to centralize server-unreachability tracking across heartbeat and claim loops, replacing per-loop state tracking. Implements coordinated failure handling with shutdown thresholds, rate-limited logging with elapsed/max-time formatting, and exponential backoff during sustained outages. Includes comprehensive unit and integration tests.

Changes

Cohort / File(s)	Summary
Core Outage State Implementation `src/zndraw_joblib/client.py`	Added `_OutageState` dataclass for centralized failure/success tracking, elapsed duration computation, shutdown thresholds, and rate-limited logging. Refactored `_heartbeat_loop` and `_claim_loop` to use shared `_outage` instance with `record_failure()`, `record_success()`, `should_shutdown()`, and `should_log()` calls. Introduced exponential backoff in `_claim_loop` during outages (capped at 10s). Removed prior `_last_server_contact`, `_contact_lock`, and helper methods.
Test Suite `tests/zndraw_joblib/test_outage_state.py`	New test module with unit tests for `_OutageState` lifecycle (elapsed time, shutdown threshold, rate-limited logging, no outage-start reset on repeated failures). Integration tests verifying `_heartbeat_loop` and `_claim_loop` emit expected WARNING/ERROR logs without raw tracebacks during simulated connection failures and validating clean shutdown messaging.
Design & Planning Documentation `docs/superpowers/specs/2026-04-10-worker-retry-ux-design.md`, `docs/superpowers/plans/2026-04-10-worker-retry-ux.md`	Specification and planning documents defining the Worker Shutdown Retry UX design, state transitions, unified exception handling, expected log format (e.g., "Server unreachable — retrying (Xs/Ys elapsed)"), and comprehensive test plan.

Sequence Diagram(s)

sequenceDiagram
    participant HB as _heartbeat_loop
    participant CL as _claim_loop
    participant OS as _OutageState
    participant Log as Logger

    rect rgba(200, 150, 255, 0.5)
    Note over HB,CL: Connection Established
    HB->>OS: record_success()
    OS-->>HB: ✓
    CL->>OS: record_success()
    OS-->>CL: ✓
    end

    rect rgba(255, 100, 100, 0.5)
    Note over HB,CL: Server Goes Down
    HB->>OS: record_failure()
    OS-->>HB: outage started
    CL->>OS: record_failure()
    OS-->>CL: outage shared
    end

    rect rgba(255, 180, 100, 0.5)
    Note over HB,CL: Ongoing Outage (rate-limited logging)
    loop Every ~30s (HB) / ~2s backoff (CL)
        HB->>OS: should_log(min_interval=60)
        alt First call or interval elapsed
            OS-->>HB: True
            HB->>Log: warning("Server unreachable — retrying (30s/120s elapsed)")
        else Recent log already sent
            OS-->>HB: False
            Note over HB: Skip redundant log
        end
        CL->>OS: should_log(min_interval=60)
        OS-->>CL: True/False (rate-limited)
    end
    end

    rect rgba(100, 150, 255, 0.5)
    Note over HB,CL: Threshold Exceeded
    HB->>OS: should_shutdown()
    OS-->>HB: True (elapsed > 120s)
    HB->>Log: error("Server unreachable for >120s, shutting down")
    HB->>HB: _stop.set()
    CL->>OS: should_shutdown()
    OS-->>CL: True
    CL->>Log: error("Server unreachable for >120s, shutting down")
    CL->>CL: _stop.set()
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

The changes introduce a new foundational class with moderate logic density (state transitions, clock-based timing, thread-safe locking), refactor two related loop handlers following a consistent pattern, and include a comprehensive test suite validating both unit-level and integration-level behavior. While the pattern is somewhat repetitive across the two loops, the new _OutageState logic and its interactions with should_shutdown() / should_log() require careful reasoning about timing semantics and thread safety.

Poem

🐰 Hops through the outage with pride so tall,
No more raw tracebacks to haunt us all!
With _OutageState keeping time fair and true,
Backoff and rate-limits make logging less blue—
Coordination, countdown, and clean exits shine,
Worker resilience: now that's divine! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 48.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main changes: improving worker retry UX and suppressing raw tracebacks, both primary objectives of the PR.
Linked Issues check	✅ Passed	All acceptance criteria from issue `#908` are met: retry logging with elapsed/max countdown, logger.error() instead of exception(), exponential backoff, no raw tracebacks, and comprehensive unit/integration tests.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with issue `#908` objectives: _OutageState implementation, retry UX improvements, backoff logic, and logging refactoring. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/908-worker-retry-ux

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/superpowers/specs/2026-04-10-worker-retry-ux-design.md`:
- Around line 55-60: Add a language identifier to the fenced code block that
shows retry logs in the worker retry UX spec so it won’t trigger MD040; edit the
triple-backtick that opens the block (the block containing the three [WARNING]
lines and the [ERROR] line) and change it to include a language such as text or
log (e.g., ```text) so the markdown linter accepts the snippet.

In `@src/zndraw_joblib/client.py`:
- Around line 139-142: The outage recovery path (record_success) clears
_outage_start but leaves _last_log_time, so a subsequent new outage can inherit
the previous rate-limit; update record_success to also reset _last_log_time
(e.g. set it to None) when clearing an outage, and likewise when you begin a new
outage (the method that sets _outage_start in the 155-162 block, e.g.,
record_failure/_begin_outage) initialize or clear _last_log_time so the
log-throttle is reset and the first retry warning after a recovery is not
suppressed.
- Around line 751-758: The shutdown path in the heartbeat handler calls
self._stop.set() but doesn't wake the blocked claim loop, so _claim_loop() can
remain sleeping on self._task_ready.wait(timeout=wait) and delay teardown; after
calling self._stop.set() in the block where self._outage.should_shutdown() is
true (the logger.error / self._stop.set() branch), also notify the claim loop by
triggering the task-ready event (e.g. call self._task_ready.set() or equivalent)
so _claim_loop() wakes immediately and can exit; locate the shutdown branch
using _outage.should_shutdown(), logger.error, and self._stop.set() to add the
wakeup call.

In `@tests/zndraw_joblib/test_outage_state.py`:
- Around line 120-123: The test starts the worker thread with
threading.Thread(target=manager._heartbeat_loop, daemon=True) and calls
t.join(timeout=5.0), but join with a timeout doesn't fail if the thread is still
running; update the test to explicitly assert the thread exited by adding an
assertion like assert not t.is_alive() after the join to ensure
manager._heartbeat_loop terminated cleanly; apply the same change for the other
occurrence that uses t.join(timeout=5.0) (lines around the second case) so both
tests validate the thread actually stopped.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eb9a0b06-bb67-4195-972f-d5f3ee339fff

📥 Commits

Reviewing files that changed from the base of the PR and between bb2198a and daf1de7.

📒 Files selected for processing (4)

docs/superpowers/plans/2026-04-10-worker-retry-ux.md
docs/superpowers/specs/2026-04-10-worker-retry-ux-design.md
src/zndraw_joblib/client.py
tests/zndraw_joblib/test_outage_state.py

coderabbitai · 2026-04-10T07:30:54Z

+```
+[WARNING] Server unreachable — retrying (5s/120s elapsed)
+[WARNING] Server unreachable — retrying (15s/120s elapsed)
+[WARNING] Server unreachable — retrying (30s/120s elapsed)
+[ERROR]   Server unreachable for >120s, shutting down. Last error: Connection refused
+```


⚠️ Potential issue | 🟡 Minor

Add a language to this fenced block.

This trips MD040 and can fail markdown lint in docs-only changes. text or log would both work here.

🧰 Tools

🪛 markdownlint-cli2 (0.22.0)

[warning] 55-55: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/superpowers/specs/2026-04-10-worker-retry-ux-design.md` around lines 55 - 60, Add a language identifier to the fenced code block that shows retry logs in the worker retry UX spec so it won’t trigger MD040; edit the triple-backtick that opens the block (the block containing the three [WARNING] lines and the [ERROR] line) and change it to include a language such as text or log (e.g., ```text) so the markdown linter accepts the snippet.

coderabbitai · 2026-04-10T07:30:54Z

+    def record_success(self) -> None:
+        """Server responded — reset outage state."""
+        with self._lock:
+            self._outage_start = None


⚠️ Potential issue | 🟡 Minor

Reset the log throttle when the outage clears.

record_success() clears _outage_start, but it leaves _last_log_time intact. If the server briefly recovers and then drops again within the 5s window, the new outage inherits the previous rate limit and its first retry warning is suppressed.

💡 Suggested fix

def record_success(self) -> None: """Server responded — reset outage state.""" with self._lock: self._outage_start = None + self._last_log_time = float("-inf")

Also applies to: 155-162

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/zndraw_joblib/client.py` around lines 139 - 142, The outage recovery path (record_success) clears _outage_start but leaves _last_log_time, so a subsequent new outage can inherit the previous rate-limit; update record_success to also reset _last_log_time (e.g. set it to None) when clearing an outage, and likewise when you begin a new outage (the method that sets _outage_start in the 155-162 block, e.g., record_failure/_begin_outage) initialize or clear _last_log_time so the log-throttle is reset and the first retry warning after a recovery is not suppressed.

coderabbitai · 2026-04-10T07:30:54Z

+                if self._outage.should_shutdown():
+                    logger.error(  # noqa: TRY400 — intentionally no traceback
+                        "Server unreachable for >%ss, shutting down. Last error: %s",
                        self._max_unreachable_seconds,
+                        e,
                    )
                    self._stop.set()
                    return


⚠️ Potential issue | 🟠 Major

Wake the claim loop when heartbeat forces shutdown.

This path only sets _stop. During an outage, _claim_loop() is blocked on _task_ready.wait(timeout=wait), so it can stay asleep for the full backoff window before teardown completes. With the new 10s cap, that delay becomes user-visible.

💡 Suggested fix

if self._outage.should_shutdown(): logger.error( # noqa: TRY400 — intentionally no traceback "Server unreachable for >%ss, shutting down. Last error: %s", self._max_unreachable_seconds, e, ) self._stop.set() + self._task_ready.set() return

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if self._outage.should_shutdown():

logger.error( # noqa: TRY400 — intentionally no traceback

"Server unreachable for >%ss, shutting down. Last error: %s",

self._max_unreachable_seconds,

e,

)

self._stop.set()

return

if self._outage.should_shutdown():

logger.error( # noqa: TRY400 — intentionally no traceback

"Server unreachable for >%ss, shutting down. Last error: %s",

self._max_unreachable_seconds,

e,

)

self._stop.set()

self._task_ready.set()

return

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/zndraw_joblib/client.py` around lines 751 - 758, The shutdown path in the heartbeat handler calls self._stop.set() but doesn't wake the blocked claim loop, so _claim_loop() can remain sleeping on self._task_ready.wait(timeout=wait) and delay teardown; after calling self._stop.set() in the block where self._outage.should_shutdown() is true (the logger.error / self._stop.set() branch), also notify the claim loop by triggering the task-ready event (e.g. call self._task_ready.set() or equivalent) so _claim_loop() wakes immediately and can exit; locate the shutdown branch using _outage.should_shutdown(), logger.error, and self._stop.set() to add the wakeup call.

coderabbitai · 2026-04-10T07:30:54Z

+            t = threading.Thread(target=manager._heartbeat_loop, daemon=True)
+            t.start()
+            t.join(timeout=5.0)
+


⚠️ Potential issue | 🟡 Minor

Assert that the worker thread actually exited.

join(timeout=5.0) is non-failing by itself, so these tests still pass if the loop hangs after producing the expected log lines. An explicit assert not t.is_alive() would lock in the clean-exit behavior this PR is targeting.

💡 Suggested fix

t = threading.Thread(target=manager._heartbeat_loop, daemon=True) t.start() t.join(timeout=5.0) + assert not t.is_alive()

t = threading.Thread(target=manager._claim_loop, daemon=True) t.start() t.join(timeout=5.0) + assert not t.is_alive()

Also applies to: 153-155

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/zndraw_joblib/test_outage_state.py` around lines 120 - 123, The test starts the worker thread with threading.Thread(target=manager._heartbeat_loop, daemon=True) and calls t.join(timeout=5.0), but join with a timeout doesn't fail if the thread is still running; update the test to explicitly assert the thread exited by adding an assertion like assert not t.is_alive() after the join to ensure manager._heartbeat_loop terminated cleanly; apply the same change for the other occurrence that uses t.join(timeout=5.0) (lines around the second case) so both tests validate the thread actually stopped.

PythonFZ and others added 8 commits April 10, 2026 08:55

docs: add design spec for worker retry UX (issue #908)

ba984d9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add implementation plan for worker retry UX (issue #908)

44e4bcf

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add _OutageState with unit tests (issue #908)

6da9307

Implements _OutageState dataclass in client.py for centralized outage tracking with injectable clock for deterministic testing. All 9 unit tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: remove unused threading import from test_outage_state

9db985b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: wire _OutageState into JobManager.__init__ (issue #908)

f552917

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor: heartbeat loop uses _OutageState with countdown logs (issue #…

38dc903

…908) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

style: cleanup imports and add noqa for intentional lint suppressions

daf1de7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve worker retry UX and suppress raw tracebacks (#908)#916

fix: improve worker retry UX and suppress raw tracebacks (#908)#916
PythonFZ wants to merge 8 commits intomainfrom
fix/908-worker-retry-ux

PythonFZ commented Apr 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 10, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 10, 2026

Uh oh!

coderabbitai Bot Apr 10, 2026

Uh oh!

coderabbitai Bot Apr 10, 2026

Uh oh!

coderabbitai Bot Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PythonFZ commented Apr 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PythonFZ commented Apr 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 10, 2026 •

edited

Loading