feat(cloud): retry transient 5xx during device-code poll#16
Merged
Conversation
Real-world deploy hit a 502 on the third poll after admin approval (Forgejo SSH-key registration upstream blip). Server-side the device_code row was rolled back to approved + unconsumed so a retry would succeed, but the workstation surfaced ``CloudAPIError`` → ``Enrollment failed`` and bailed. Adds a small retry budget on transient errors inside the poll loop in ``WorkstationSetup.from_device_code``: a ``CloudNetworkError`` or ``CloudAPIError`` no longer terminates the flow — it sleeps the configured ``interval_seconds`` and re-polls. After ``_POLL_TRANSIENT_RETRY_LIMIT`` (= 6) consecutive transient failures the original exception propagates so a persistent outage doesn't spin forever. The terminal RFC 8628 states (denied / expired / fingerprint rejected) still propagate immediately. Pending state still resets the transient counter. Tests: - ``test_transient_502_during_poll_is_retried`` — pending → 502 → 502 → 200, three sleeps, success. - ``test_persistent_5xx_eventually_propagates`` — ``RETRY_LIMIT + 2`` consecutive 502s → CloudAPIError raised. 138 → 140 cloud tests, all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The retry path inside ``from_device_code`` already logs the full exception, but the outer worker only emitted ``type(exc).__name__`` per the security-review precedent for ``WorkstationSetupError`` — which can carry a keyring-recovery token in its message. ``CloudAPIError`` does NOT carry token material (no ``api_token`` is ever stuffed into a CloudAPIError; it only holds the upstream HTTP status + truncated response body). Splitting the except branches lets the worker log the full ``CloudAPIError`` message while keeping the ``WorkstationSetupError`` path type-only for safety. Useful for diagnosing transient backend failures like the 2026-05-10 17:03 UTC Forgejo upstream blip where workstation logs only said ``CloudAPIError`` with no detail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Real-world deploy hit a 502 on the third poll after admin approval (Forgejo SSH-key registration upstream blip — see Cloud Run logs around 2026-05-10 17:03 UTC).
Server side (in pychronAPI PR #39) the failure rolls back the device-code row to approved + unconsumed so a retry would re-attempt the mint. Client side (this PR) the worker treated `CloudAPIError` as terminal → "Enrollment failed" and bailed, leaving the admin's approval wasted.
This PR adds a retry budget inside the `WorkstationSetup.from_device_code` poll loop: `CloudNetworkError` / `CloudAPIError` (5xx) sleep the configured interval and re-poll. After `_POLL_TRANSIENT_RETRY_LIMIT` (= 6) consecutive transient failures the original exception propagates so a persistent outage surfaces clearly.
The terminal RFC 8628 states (denied / expired / fingerprint rejected) still propagate immediately. Pending state still resets the transient counter so a long approve wait doesn't burn the retry budget.
Test plan
Pairs with
PychronLabsLLC/pychronAPI#39 — server-side improves `logger.exception` so Cloud Run captures the upstream Forgejo error detail.
🤖 Generated with Claude Code