fix: improve worker retry UX and suppress raw tracebacks (#908)#916
fix: improve worker retry UX and suppress raw tracebacks (#908)#916
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements _OutageState dataclass in client.py for centralized outage tracking with injectable clock for deterministic testing. All 9 unit tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…908) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace old-style warning/exception logging in _claim_loop with _OutageState.record_failure()/.should_shutdown()/.should_log() and add exponential backoff (capped at 10s) matching the heartbeat loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughIntroduces a shared Changes
Sequence Diagram(s)sequenceDiagram
participant HB as _heartbeat_loop
participant CL as _claim_loop
participant OS as _OutageState
participant Log as Logger
rect rgba(200, 150, 255, 0.5)
Note over HB,CL: Connection Established
HB->>OS: record_success()
OS-->>HB: ✓
CL->>OS: record_success()
OS-->>CL: ✓
end
rect rgba(255, 100, 100, 0.5)
Note over HB,CL: Server Goes Down
HB->>OS: record_failure()
OS-->>HB: outage started
CL->>OS: record_failure()
OS-->>CL: outage shared
end
rect rgba(255, 180, 100, 0.5)
Note over HB,CL: Ongoing Outage (rate-limited logging)
loop Every ~30s (HB) / ~2s backoff (CL)
HB->>OS: should_log(min_interval=60)
alt First call or interval elapsed
OS-->>HB: True
HB->>Log: warning("Server unreachable — retrying (30s/120s elapsed)")
else Recent log already sent
OS-->>HB: False
Note over HB: Skip redundant log
end
CL->>OS: should_log(min_interval=60)
OS-->>CL: True/False (rate-limited)
end
end
rect rgba(100, 150, 255, 0.5)
Note over HB,CL: Threshold Exceeded
HB->>OS: should_shutdown()
OS-->>HB: True (elapsed > 120s)
HB->>Log: error("Server unreachable for >120s, shutting down")
HB->>HB: _stop.set()
CL->>OS: should_shutdown()
OS-->>CL: True
CL->>Log: error("Server unreachable for >120s, shutting down")
CL->>CL: _stop.set()
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes The changes introduce a new foundational class with moderate logic density (state transitions, clock-based timing, thread-safe locking), refactor two related loop handlers following a consistent pattern, and include a comprehensive test suite validating both unit-level and integration-level behavior. While the pattern is somewhat repetitive across the two loops, the new Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/superpowers/specs/2026-04-10-worker-retry-ux-design.md`:
- Around line 55-60: Add a language identifier to the fenced code block that
shows retry logs in the worker retry UX spec so it won’t trigger MD040; edit the
triple-backtick that opens the block (the block containing the three [WARNING]
lines and the [ERROR] line) and change it to include a language such as text or
log (e.g., ```text) so the markdown linter accepts the snippet.
In `@src/zndraw_joblib/client.py`:
- Around line 139-142: The outage recovery path (record_success) clears
_outage_start but leaves _last_log_time, so a subsequent new outage can inherit
the previous rate-limit; update record_success to also reset _last_log_time
(e.g. set it to None) when clearing an outage, and likewise when you begin a new
outage (the method that sets _outage_start in the 155-162 block, e.g.,
record_failure/_begin_outage) initialize or clear _last_log_time so the
log-throttle is reset and the first retry warning after a recovery is not
suppressed.
- Around line 751-758: The shutdown path in the heartbeat handler calls
self._stop.set() but doesn't wake the blocked claim loop, so _claim_loop() can
remain sleeping on self._task_ready.wait(timeout=wait) and delay teardown; after
calling self._stop.set() in the block where self._outage.should_shutdown() is
true (the logger.error / self._stop.set() branch), also notify the claim loop by
triggering the task-ready event (e.g. call self._task_ready.set() or equivalent)
so _claim_loop() wakes immediately and can exit; locate the shutdown branch
using _outage.should_shutdown(), logger.error, and self._stop.set() to add the
wakeup call.
In `@tests/zndraw_joblib/test_outage_state.py`:
- Around line 120-123: The test starts the worker thread with
threading.Thread(target=manager._heartbeat_loop, daemon=True) and calls
t.join(timeout=5.0), but join with a timeout doesn't fail if the thread is still
running; update the test to explicitly assert the thread exited by adding an
assertion like assert not t.is_alive() after the join to ensure
manager._heartbeat_loop terminated cleanly; apply the same change for the other
occurrence that uses t.join(timeout=5.0) (lines around the second case) so both
tests validate the thread actually stopped.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: eb9a0b06-bb67-4195-972f-d5f3ee339fff
📒 Files selected for processing (4)
docs/superpowers/plans/2026-04-10-worker-retry-ux.mddocs/superpowers/specs/2026-04-10-worker-retry-ux-design.mdsrc/zndraw_joblib/client.pytests/zndraw_joblib/test_outage_state.py
| ``` | ||
| [WARNING] Server unreachable — retrying (5s/120s elapsed) | ||
| [WARNING] Server unreachable — retrying (15s/120s elapsed) | ||
| [WARNING] Server unreachable — retrying (30s/120s elapsed) | ||
| [ERROR] Server unreachable for >120s, shutting down. Last error: Connection refused | ||
| ``` |
There was a problem hiding this comment.
Add a language to this fenced block.
This trips MD040 and can fail markdown lint in docs-only changes. text or log would both work here.
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 55-55: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/superpowers/specs/2026-04-10-worker-retry-ux-design.md` around lines 55
- 60, Add a language identifier to the fenced code block that shows retry logs
in the worker retry UX spec so it won’t trigger MD040; edit the triple-backtick
that opens the block (the block containing the three [WARNING] lines and the
[ERROR] line) and change it to include a language such as text or log (e.g.,
```text) so the markdown linter accepts the snippet.
| def record_success(self) -> None: | ||
| """Server responded — reset outage state.""" | ||
| with self._lock: | ||
| self._outage_start = None |
There was a problem hiding this comment.
Reset the log throttle when the outage clears.
record_success() clears _outage_start, but it leaves _last_log_time intact. If the server briefly recovers and then drops again within the 5s window, the new outage inherits the previous rate limit and its first retry warning is suppressed.
💡 Suggested fix
def record_success(self) -> None:
"""Server responded — reset outage state."""
with self._lock:
self._outage_start = None
+ self._last_log_time = float("-inf")Also applies to: 155-162
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/zndraw_joblib/client.py` around lines 139 - 142, The outage recovery path
(record_success) clears _outage_start but leaves _last_log_time, so a subsequent
new outage can inherit the previous rate-limit; update record_success to also
reset _last_log_time (e.g. set it to None) when clearing an outage, and likewise
when you begin a new outage (the method that sets _outage_start in the 155-162
block, e.g., record_failure/_begin_outage) initialize or clear _last_log_time so
the log-throttle is reset and the first retry warning after a recovery is not
suppressed.
| if self._outage.should_shutdown(): | ||
| logger.error( # noqa: TRY400 — intentionally no traceback | ||
| "Server unreachable for >%ss, shutting down. Last error: %s", | ||
| self._max_unreachable_seconds, | ||
| e, | ||
| ) | ||
| self._stop.set() | ||
| return |
There was a problem hiding this comment.
Wake the claim loop when heartbeat forces shutdown.
This path only sets _stop. During an outage, _claim_loop() is blocked on _task_ready.wait(timeout=wait), so it can stay asleep for the full backoff window before teardown completes. With the new 10s cap, that delay becomes user-visible.
💡 Suggested fix
if self._outage.should_shutdown():
logger.error( # noqa: TRY400 — intentionally no traceback
"Server unreachable for >%ss, shutting down. Last error: %s",
self._max_unreachable_seconds,
e,
)
self._stop.set()
+ self._task_ready.set()
return📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self._outage.should_shutdown(): | |
| logger.error( # noqa: TRY400 — intentionally no traceback | |
| "Server unreachable for >%ss, shutting down. Last error: %s", | |
| self._max_unreachable_seconds, | |
| e, | |
| ) | |
| self._stop.set() | |
| return | |
| if self._outage.should_shutdown(): | |
| logger.error( # noqa: TRY400 — intentionally no traceback | |
| "Server unreachable for >%ss, shutting down. Last error: %s", | |
| self._max_unreachable_seconds, | |
| e, | |
| ) | |
| self._stop.set() | |
| self._task_ready.set() | |
| return |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/zndraw_joblib/client.py` around lines 751 - 758, The shutdown path in the
heartbeat handler calls self._stop.set() but doesn't wake the blocked claim
loop, so _claim_loop() can remain sleeping on
self._task_ready.wait(timeout=wait) and delay teardown; after calling
self._stop.set() in the block where self._outage.should_shutdown() is true (the
logger.error / self._stop.set() branch), also notify the claim loop by
triggering the task-ready event (e.g. call self._task_ready.set() or equivalent)
so _claim_loop() wakes immediately and can exit; locate the shutdown branch
using _outage.should_shutdown(), logger.error, and self._stop.set() to add the
wakeup call.
| t = threading.Thread(target=manager._heartbeat_loop, daemon=True) | ||
| t.start() | ||
| t.join(timeout=5.0) | ||
|
|
There was a problem hiding this comment.
Assert that the worker thread actually exited.
join(timeout=5.0) is non-failing by itself, so these tests still pass if the loop hangs after producing the expected log lines. An explicit assert not t.is_alive() would lock in the clean-exit behavior this PR is targeting.
💡 Suggested fix
t = threading.Thread(target=manager._heartbeat_loop, daemon=True)
t.start()
t.join(timeout=5.0)
+ assert not t.is_alive() t = threading.Thread(target=manager._claim_loop, daemon=True)
t.start()
t.join(timeout=5.0)
+ assert not t.is_alive()Also applies to: 153-155
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/zndraw_joblib/test_outage_state.py` around lines 120 - 123, The test
starts the worker thread with threading.Thread(target=manager._heartbeat_loop,
daemon=True) and calls t.join(timeout=5.0), but join with a timeout doesn't fail
if the thread is still running; update the test to explicitly assert the thread
exited by adding an assertion like assert not t.is_alive() after the join to
ensure manager._heartbeat_loop terminated cleanly; apply the same change for the
other occurrence that uses t.join(timeout=5.0) (lines around the second case) so
both tests validate the thread actually stopped.
Summary
_OutageStatedataclass to centralize outage tracking across_heartbeat_loopand_claim_loopServer unreachable — retrying (30s/120s elapsed)logger.error()instead oflogger.exception()— no raw tracebacks for expected connection errors_claim_loopduring sustained outage (2s → 4s → 8s → 10s cap)Closes #908
Test plan
_OutageState(injectable clock, no server/threads needed)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Tests