Skip to content

feat(cloud): retry transient 5xx during device-code poll#16

Merged
jirhiker merged 2 commits into
mainfrom
feat/device-code-retry-transient
May 10, 2026
Merged

feat(cloud): retry transient 5xx during device-code poll#16
jirhiker merged 2 commits into
mainfrom
feat/device-code-retry-transient

Conversation

@jirhiker
Copy link
Copy Markdown

Summary

Real-world deploy hit a 502 on the third poll after admin approval (Forgejo SSH-key registration upstream blip — see Cloud Run logs around 2026-05-10 17:03 UTC).

Server side (in pychronAPI PR #39) the failure rolls back the device-code row to approved + unconsumed so a retry would re-attempt the mint. Client side (this PR) the worker treated `CloudAPIError` as terminal → "Enrollment failed" and bailed, leaving the admin's approval wasted.

This PR adds a retry budget inside the `WorkstationSetup.from_device_code` poll loop: `CloudNetworkError` / `CloudAPIError` (5xx) sleep the configured interval and re-poll. After `_POLL_TRANSIENT_RETRY_LIMIT` (= 6) consecutive transient failures the original exception propagates so a persistent outage surfaces clearly.

The terminal RFC 8628 states (denied / expired / fingerprint rejected) still propagate immediately. Pending state still resets the transient counter so a long approve wait doesn't burn the retry budget.

Test plan

  • `test_transient_502_during_poll_is_retried` — pending → 502 → 502 → 200, three sleeps, success.
  • `test_persistent_5xx_eventually_propagates` — RETRY_LIMIT + 2 consecutive 502s → CloudAPIError raised.
  • 138 → 140 cloud tests, all green.

Pairs with

PychronLabsLLC/pychronAPI#39 — server-side improves `logger.exception` so Cloud Run captures the upstream Forgejo error detail.

🤖 Generated with Claude Code

jirhiker and others added 2 commits May 10, 2026 11:17
Real-world deploy hit a 502 on the third poll after admin approval
(Forgejo SSH-key registration upstream blip). Server-side the
device_code row was rolled back to approved + unconsumed so a retry
would succeed, but the workstation surfaced ``CloudAPIError`` →
``Enrollment failed`` and bailed.

Adds a small retry budget on transient errors inside the poll loop
in ``WorkstationSetup.from_device_code``: a ``CloudNetworkError`` or
``CloudAPIError`` no longer terminates the flow — it sleeps the
configured ``interval_seconds`` and re-polls. After
``_POLL_TRANSIENT_RETRY_LIMIT`` (= 6) consecutive transient failures
the original exception propagates so a persistent outage doesn't
spin forever.

The terminal RFC 8628 states (denied / expired / fingerprint
rejected) still propagate immediately. Pending state still resets
the transient counter.

Tests:
- ``test_transient_502_during_poll_is_retried`` — pending → 502 →
  502 → 200, three sleeps, success.
- ``test_persistent_5xx_eventually_propagates`` —
  ``RETRY_LIMIT + 2`` consecutive 502s → CloudAPIError raised.

138 → 140 cloud tests, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The retry path inside ``from_device_code`` already logs the full
exception, but the outer worker only emitted ``type(exc).__name__``
per the security-review precedent for ``WorkstationSetupError`` —
which can carry a keyring-recovery token in its message.

``CloudAPIError`` does NOT carry token material (no ``api_token`` is
ever stuffed into a CloudAPIError; it only holds the upstream HTTP
status + truncated response body). Splitting the except branches
lets the worker log the full ``CloudAPIError`` message while keeping
the ``WorkstationSetupError`` path type-only for safety.

Useful for diagnosing transient backend failures like the
2026-05-10 17:03 UTC Forgejo upstream blip where workstation logs
only said ``CloudAPIError`` with no detail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jirhiker jirhiker merged commit f54f5fb into main May 10, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant