Skip to content

[ci-flake] miniflare-remote-resources.test.ts shard 3 flakes on transient API 5xx #13831

@petebacondarwin

Description

@petebacondarwin

Summary

The e2e/remote-binding/miniflare-remote-resources.test.ts suite (Wrangler E2E shard 3/4 across macOS/Linux/Windows) regularly fails when the Cloudflare API returns a transient 5xx during the shared beforeAll setup. Because all ~16 test cases share a single beforeAll, a single transient 5xx skips every test in the suite and fails the job. Vitest's automatic retry re-runs the same beforeAll against the same flaky API window and usually fails again.

Recent occurrences

Run Branch Failed shards First failing API call
25435066708 changeset-release/main macOS, Linux shard 3 shared suite
25428017066 pbd/agent-memory (#13610) macOS, Linux, Windows shard 3 POST /workers/scripts/{id}/edge-preview → 500 (Ray 9f78853b3ef09913-SJC); on retries also /r2/buckets → code 10001 and /mtls_certificates/{id} failures

Both runs were on the same day, ~2 hours apart. The branches share no relevant code change, confirming the failure is environmental.

Root cause

packages/wrangler/e2e/remote-binding/miniflare-remote-resources.test.ts:670-720 packs ~16 unrelated test cases (KV, R2, D1, Vectorize, Browser, Service Binding, AI Search, Dispatch, Agent Memory, VPC, …) into one describe with one shared beforeAll:

beforeAll(async () => {
  for (const testCase of activeTestCases) {
    testConfigs.push(await testCase.setup(helper));   // hits real Cloudflare APIs
  }
  const remoteProxySession = await startRemoteProxySession(...);  // ← single call, no retry
});

Contributing factors:

  1. No retry on transient 5xx. startRemoteProxySession (packages/wrangler/src/api/remoteBindings/start-remote-proxy-session.ts:38-65) makes one startWorker(...) call; the catch just rethrows.
  2. Atomic failure across unrelated cases. A 500 on R2 bucket creation for the "AI Search Namespace" setup also fails the Agent Memory test, the Dispatch Namespace test, and 13 others.
  3. Probability multiplied by N. With ~16 sequential setup calls each hitting a different Cloudflare endpoint, the combined transient-failure probability is roughly Nx the single-call rate. This is why shard 3 (the one with this suite) is the flakiest.
  4. Vitest's automatic retry doesn't help. It re-runs the whole file; the entire beforeAll runs again, often hitting the same degraded window.
  5. Error message obscures cause. The visible error is "Failed to start the remote proxy session. There is likely additional logging output above" — the actual 5xx is buried in stderr.

There's already an e2e/helpers/retry.ts used in dev.test.ts and deployments.test.ts, but it isn't used in this test.

Suggested fixes (smallest → largest)

  1. Wrap startRemoteProxySession(...) calls inside beforeAll in retry(). Test-only; targets the most common observed flake.
  2. Wrap individual setup helpers (helper.r2, helper.dispatchNamespace, helper.cert, helper.tunnel, …) in retry() so the dispatched API call is retried, not just the proxy worker upload.
  3. Wrap the whole beforeAll in a retry-with-full-teardown loop. Cleaner if individual setups aren't idempotent.
  4. Refactor: split each test case into its own describe/beforeAll (mTLS at line 736 already does this). Eliminates the blast-radius problem entirely; biggest change.
  5. Optionally: retry transient 5xx in createPreviewToken (packages/wrangler/src/dev/create-worker-preview.ts:251) for parity with how dev users would experience it. Production change; needs care.

Prior flake-reduction work

Metadata

Metadata

Assignees

No one assigned

    Labels

    ci-flakeApplied to PRs addressing CI flakinesse2eRun wrangler + vite-plugin e2e tests on a PR

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions