Skip to content

fix(backend): retry manifest size lookup#2292

Merged
riderx merged 2 commits into
mainfrom
codex/fix-manifest-file-size-retry
May 18, 2026
Merged

fix(backend): retry manifest size lookup#2292
riderx merged 2 commits into
mainfrom
codex/fix-manifest-file-size-retry

Conversation

@riderx
Copy link
Copy Markdown
Member

@riderx riderx commented May 18, 2026

Summary (AI generated)

  • Retry manifest file-size lookups when trusted storage metadata is unavailable instead of deleting the queue message as successful.
  • Cap on_manifest_create queue batches/concurrency to reduce large-manifest storage bursts.
  • Add structured storage/queue logging and a dry-run-first backfill script for existing zero-size manifest rows.

Motivation (AI generated)

Manifest rows could stay at file_size = 0 because the worker returned success when R2/S3 metadata lookup returned no size. Large manifests amplified this by letting the queue consumer fan out hundreds of storage lookups at once.

Business Impact (AI generated)

This restores reliable manifest size metadata, improves billing/storage accounting accuracy, and gives support/ops a repair path for affected historical bundles without trusting client-provided file sizes.

Test Plan (AI generated)

  • bun run lint:backend
  • bun run lint
  • bun run cli:lint
  • bun run typecheck
  • bunx vitest run tests/backend-alert-resilience.unit.test.ts tests/queue-consumer-message-shape.unit.test.ts
  • bun scripts/backfill_manifest_file_sizes.mjs --help
  • bunx eslint --no-ignore scripts/backfill_manifest_file_sizes.mjs
  • git diff --check

Summary by CodeRabbit

  • New Features

    • Enhanced manifest size recovery with configurable retry behavior and queue-aware processing
    • New backfill script to scan and populate missing manifest file sizes (dry-run by default; can apply)
  • Bug Fixes

    • Improved handling of missing/unresolvable storage sizes to preserve retry semantics and surface actionable failures
    • Safer queue dispatch with bounded concurrency and batch-size capping
  • Tests

    • Added unit tests covering retry decisions and queue batch/concurrency behaviors
  • Chores

    • Better storage error reporting and diagnostic output

Review Change Stack

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fa7d5637-afa5-4417-ab0c-5b870391bb0d

📥 Commits

Reviewing files that changed from the base of the PR and between c40dc07 and 1120a7a.

📒 Files selected for processing (3)
  • scripts/backfill_manifest_file_sizes.mjs
  • supabase/functions/_backend/triggers/queue_consumer.ts
  • supabase/functions/_backend/utils/s3.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • supabase/functions/_backend/utils/s3.ts
  • scripts/backfill_manifest_file_sizes.mjs
  • supabase/functions/_backend/triggers/queue_consumer.ts

📝 Walkthrough

Walkthrough

Adds S3 sizing/fallback refactors, trigger retry logic for missing manifest sizes, queue consumer tuning (batch/concurrency and queue headers), a backfill CLI to populate missing sizes, and unit tests validating retry and queue behavior.

Changes

Manifest file size handling and queue optimization

Layer / File(s) Summary
S3 sizing refactor with error handling
supabase/functions/_backend/utils/s3.ts
getSize refactored: parses Content-Range/Content-Length, serializes storage errors, and performs a minimal Range: bytes=0-0 GET fallback when HEAD is insufficient; logs structured outcomes and fallback usage.
Manifest size retry semantics in trigger
supabase/functions/_backend/triggers/on_manifest_create.ts, tests/backend-alert-resilience.unit.test.ts
Trigger captures queue metadata, getManifestSizeWithRetry now returns attempts, added shouldRetryManifestSizeLookup, and updateManifestSize throws 503 manifest_size_not_found when size unresolved and no valid persisted file_size exists. Test added for retry decision logic.
Queue consumer batch/concurrency tuning and metadata
supabase/functions/_backend/triggers/queue_consumer.ts, tests/queue-consumer-message-shape.unit.test.ts
Introduces per-queue batch and HTTP concurrency caps (e.g., manifest jobs reduced), bounded mapWithConcurrency worker pool, partitions messages by retry budget and archives exceeded-budget items, threads QueueMessageMetadata into http_post_helper as x-capgo-queue-* headers, cancels unused response bodies, and caps /sync requested batch size. Tests validate tuning and retry-budget behavior.
Manifest file size backfill CLI tool
scripts/backfill_manifest_file_sizes.mjs
New CLI that pages public.manifest rows with missing/invalid file_size, concurrently resolves S3 object sizes (HEAD + presigned range fallback), supports dry-run and --apply, scoping options, concurrency controls, and emits a timestamped JSON report; exits with code 1 when unresolved sizes are found.

Sequence Diagram

sequenceDiagram
  participant QC as Queue Consumer
  participant Tuning as getQueueBatchSize
  participant Pool as mapWithConcurrency
  participant Helper as http_post_helper
  participant Trigger as on_manifest_create
  participant S3 as S3/R2 Storage

  QC->>Tuning: getQueueBatchSize(manifest)
  Tuning-->>QC: capped batch (e.g., 100)
  QC->>Pool: process messages with bounded workers (e.g., 10)
  Pool->>Helper: dispatch POST with QueueMessageMetadata headers
  Helper->>Trigger: invoke on_manifest_create
  Trigger->>S3: getSize(s3_path) via statObject (HEAD)
  alt HEAD returns size
    S3-->>Trigger: object size
  else HEAD insufficient
    S3->>Trigger: Range GET fallback (bytes=0-0)
    S3-->>Trigger: size or 0
  end
  Trigger->>Trigger: shouldRetryManifestSizeLookup?
  alt missing and no valid persisted file_size
    Trigger-->>Helper: respond 503 manifest_size_not_found (retain for retry)
  else valid or existing fileSize present
    Trigger-->>Helper: respond 200 success (update or keep)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through headers, bytes, and queues,
Fetching sizes, chasing tiny clues;
When HEAD failed, I peeked with a range,
Tuned the queue, so retries arrange;
Backfill reports and tests — nimble news.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'fix(backend): retry manifest size lookup' clearly and concisely summarizes the main change—addressing manifest file-size lookup failures by implementing retry logic instead of treating them as successful.
Description check ✅ Passed The PR description is comprehensive and well-structured with Summary, Motivation, Business Impact, and a detailed Test Plan section with checkmarks confirming completion of required linting, type-checking, and unit tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/fix-manifest-file-size-retry

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread scripts/backfill_manifest_file_sizes.mjs Fixed
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
supabase/functions/_backend/triggers/queue_consumer.ts (1)

220-225: ⚡ Quick win

Keep the new queue tuning logs structured.

These new cloudlog() calls drop requestId and the batch metrics into an unstructured string, which makes the new concurrency/budget behavior harder to trace in production.

Proposed fix
-  cloudlog(`[${queueName}] Processing ${messagesToProcess.length} messages and skipping ${messagesToSkip.length} messages with concurrency ${processConcurrency}.`)
+  cloudlog({
+    requestId: c.get('requestId'),
+    message: `[${queueName}] Processing queue batch.`,
+    processingCount: messagesToProcess.length,
+    skippedCount: messagesToSkip.length,
+    concurrency: processConcurrency,
+  })

   // Archive messages after the configured retry budget is exhausted.
   if (messagesToSkip.length > 0) {
-    cloudlog(`[${queueName}] Archiving ${messagesToSkip.length} messages that exceeded the retry budget.`)
+    cloudlog({
+      requestId: c.get('requestId'),
+      message: `[${queueName}] Archiving messages that exceeded the retry budget.`,
+      archiveCount: messagesToSkip.length,
+    })
     await archive_queue_messages(c, db, queueName, messagesToSkip.map(msg => msg.msg_id))
   }

As per coding guidelines, "All endpoints must receive Hono Context<MiddlewareKeyVariables> object and use c.get('requestId') for structured logging with cloudlog()."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@supabase/functions/_backend/triggers/queue_consumer.ts` around lines 220 -
225, The new cloudlog calls that log concurrency and archiving (using
getQueueHttpConcurrency, processConcurrency, messagesToProcess, messagesToSkip,
queueName) must be converted to structured logs that include the Hono requestId
and batch metrics: obtain requestId via c.get('requestId') and pass it plus
metrics (e.g., queueName, processConcurrency, messagesToProcess.length,
messagesToSkip.length, and any batch id) as structured fields to cloudlog
instead of interpolated strings; update the cloudlog invocations around where
processConcurrency is computed and where archiving is logged to use the
context-derived requestId and explicit key/value fields so production tracing
remains consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/backfill_manifest_file_sizes.mjs`:
- Around line 52-54: The current getArgValue logic returns the next argv token
even if it's another flag (e.g., "--apply"), causing silent no-ops; update
getArgValue to validate the candidate value: after finding const index =
process.argv.indexOf(name) ensure process.argv[index + 1] exists and that it
does NOT start with '--' (or otherwise indicate a flag), and if the check fails
return undefined or throw a clear error; reference the getArgValue function and
the process.argv[index + 1] lookup so you replace the unconditional return with
a guarded check that rejects missing/flag tokens.
- Around line 141-160: The current code always calls parseObjectSizeFromHeaders
even for failed (4xx/5xx) or partial responses, which lets error XML bodies or
206 responses without Content-Range produce bogus sizes; change the logic to
only derive size for successful range responses: check response.status first and
if status is >= 400 return without a size, and only call
parseObjectSizeFromHeaders when response.status === 206 AND contentRange exists
(do not accept Content-Length as the full object size for 206 if Content-Range
is missing); ensure you still cancel response.body and return
contentLength/contentRange/method/status/statusText but leave size undefined
when not safely derivable (use the existing parseObjectSizeFromHeaders,
contentRange, contentLength and response.status symbols to locate and adjust the
logic).
- Around line 24-31: The current loop calling config({ path: resolve(__dirname,
envPath) }) loads env files in an order where earlier files win; change the
loading so later files override earlier ones by either reversing the envPath
array order or (preferably) passing override: true for subsequent loads; update
the loop that iterates over envPath (the config call) to call config({ path:
resolve(__dirname, envPath), override: true }) for files that should take
precedence (e.g., .env.local and cloudflare .envs) so .local values override
base .env values.

In `@supabase/functions/_backend/utils/s3.ts`:
- Around line 218-242: When doing the ranged GET fallback, don't parse
Content-Length/Content-Range for non-success responses: after awaiting
fetch(url...) check res.ok (or res.status) and if the response is not
successful, cancel res.body, log the failure via the existing cloudlog call
(including status/statusText/requestId/fileId/reason) and return a sentinel
(e.g. undefined/null) instead of calling parseObjectSizeFromHeaders; only call
parseObjectSizeFromHeaders and return size when res.ok is true. Ensure you use
the same local symbols (res, parseObjectSizeFromHeaders, cloudlog, fileId,
reason, c.get('requestId')) so the change is localized.

---

Nitpick comments:
In `@supabase/functions/_backend/triggers/queue_consumer.ts`:
- Around line 220-225: The new cloudlog calls that log concurrency and archiving
(using getQueueHttpConcurrency, processConcurrency, messagesToProcess,
messagesToSkip, queueName) must be converted to structured logs that include the
Hono requestId and batch metrics: obtain requestId via c.get('requestId') and
pass it plus metrics (e.g., queueName, processConcurrency,
messagesToProcess.length, messagesToSkip.length, and any batch id) as structured
fields to cloudlog instead of interpolated strings; update the cloudlog
invocations around where processConcurrency is computed and where archiving is
logged to use the context-derived requestId and explicit key/value fields so
production tracing remains consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 68a850c4-d1ac-498f-b4a1-f16cbac57b41

📥 Commits

Reviewing files that changed from the base of the PR and between 70f5798 and c40dc07.

📒 Files selected for processing (6)
  • scripts/backfill_manifest_file_sizes.mjs
  • supabase/functions/_backend/triggers/on_manifest_create.ts
  • supabase/functions/_backend/triggers/queue_consumer.ts
  • supabase/functions/_backend/utils/s3.ts
  • tests/backend-alert-resilience.unit.test.ts
  • tests/queue-consumer-message-shape.unit.test.ts

Comment thread scripts/backfill_manifest_file_sizes.mjs
Comment thread scripts/backfill_manifest_file_sizes.mjs Outdated
Comment thread scripts/backfill_manifest_file_sizes.mjs
Comment thread supabase/functions/_backend/utils/s3.ts
@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented May 18, 2026

Merging this PR will not alter performance

✅ 43 untouched benchmarks
⏩ 2 skipped benchmarks1


Comparing codex/fix-manifest-file-size-retry (1120a7a) with main (22ea833)

Open in CodSpeed

Footnotes

  1. 2 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@sonarqubecloud
Copy link
Copy Markdown

@riderx riderx merged commit f9982c6 into main May 18, 2026
54 of 55 checks passed
@riderx riderx deleted the codex/fix-manifest-file-size-retry branch May 18, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant