fix(backend): retry manifest size lookup by riderx · Pull Request #2292 · Cap-go/capgo

riderx · 2026-05-18T15:15:26Z

Summary (AI generated)

Retry manifest file-size lookups when trusted storage metadata is unavailable instead of deleting the queue message as successful.
Cap on_manifest_create queue batches/concurrency to reduce large-manifest storage bursts.
Add structured storage/queue logging and a dry-run-first backfill script for existing zero-size manifest rows.

Motivation (AI generated)

Manifest rows could stay at file_size = 0 because the worker returned success when R2/S3 metadata lookup returned no size. Large manifests amplified this by letting the queue consumer fan out hundreds of storage lookups at once.

Business Impact (AI generated)

This restores reliable manifest size metadata, improves billing/storage accounting accuracy, and gives support/ops a repair path for affected historical bundles without trusting client-provided file sizes.

Test Plan (AI generated)

bun run lint:backend
bun run lint
bun run cli:lint
bun run typecheck
bunx vitest run tests/backend-alert-resilience.unit.test.ts tests/queue-consumer-message-shape.unit.test.ts
bun scripts/backfill_manifest_file_sizes.mjs --help
bunx eslint --no-ignore scripts/backfill_manifest_file_sizes.mjs
git diff --check

Summary by CodeRabbit

New Features
- Enhanced manifest size recovery with configurable retry behavior and queue-aware processing
- New backfill script to scan and populate missing manifest file sizes (dry-run by default; can apply)
Bug Fixes
- Improved handling of missing/unresolvable storage sizes to preserve retry semantics and surface actionable failures
- Safer queue dispatch with bounded concurrency and batch-size capping
Tests
- Added unit tests covering retry decisions and queue batch/concurrency behaviors
Chores
- Better storage error reporting and diagnostic output

chatgpt-codex-connector · 2026-05-18T15:15:33Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

coderabbitai · 2026-05-18T15:15:42Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fa7d5637-afa5-4417-ab0c-5b870391bb0d

📥 Commits

Reviewing files that changed from the base of the PR and between c40dc07 and 1120a7a.

📒 Files selected for processing (3)

scripts/backfill_manifest_file_sizes.mjs
supabase/functions/_backend/triggers/queue_consumer.ts
supabase/functions/_backend/utils/s3.ts

🚧 Files skipped from review as they are similar to previous changes (3)

supabase/functions/_backend/utils/s3.ts
scripts/backfill_manifest_file_sizes.mjs
supabase/functions/_backend/triggers/queue_consumer.ts

📝 Walkthrough

Walkthrough

Adds S3 sizing/fallback refactors, trigger retry logic for missing manifest sizes, queue consumer tuning (batch/concurrency and queue headers), a backfill CLI to populate missing sizes, and unit tests validating retry and queue behavior.

Changes

Manifest file size handling and queue optimization

Layer / File(s)	Summary
S3 sizing refactor with error handling `supabase/functions/_backend/utils/s3.ts`	`getSize` refactored: parses `Content-Range`/`Content-Length`, serializes storage errors, and performs a minimal `Range: bytes=0-0` GET fallback when HEAD is insufficient; logs structured outcomes and fallback usage.
Manifest size retry semantics in trigger `supabase/functions/_backend/triggers/on_manifest_create.ts`, `tests/backend-alert-resilience.unit.test.ts`	Trigger captures queue metadata, `getManifestSizeWithRetry` now returns `attempts`, added `shouldRetryManifestSizeLookup`, and `updateManifestSize` throws 503 `manifest_size_not_found` when size unresolved and no valid persisted `file_size` exists. Test added for retry decision logic.
Queue consumer batch/concurrency tuning and metadata `supabase/functions/_backend/triggers/queue_consumer.ts`, `tests/queue-consumer-message-shape.unit.test.ts`	Introduces per-queue batch and HTTP concurrency caps (e.g., manifest jobs reduced), bounded `mapWithConcurrency` worker pool, partitions messages by retry budget and archives exceeded-budget items, threads `QueueMessageMetadata` into `http_post_helper` as `x-capgo-queue-*` headers, cancels unused response bodies, and caps `/sync` requested batch size. Tests validate tuning and retry-budget behavior.
Manifest file size backfill CLI tool `scripts/backfill_manifest_file_sizes.mjs`	New CLI that pages `public.manifest` rows with missing/invalid `file_size`, concurrently resolves S3 object sizes (HEAD + presigned range fallback), supports dry-run and `--apply`, scoping options, concurrency controls, and emits a timestamped JSON report; exits with code `1` when unresolved sizes are found.

Sequence Diagram

sequenceDiagram
  participant QC as Queue Consumer
  participant Tuning as getQueueBatchSize
  participant Pool as mapWithConcurrency
  participant Helper as http_post_helper
  participant Trigger as on_manifest_create
  participant S3 as S3/R2 Storage

  QC->>Tuning: getQueueBatchSize(manifest)
  Tuning-->>QC: capped batch (e.g., 100)
  QC->>Pool: process messages with bounded workers (e.g., 10)
  Pool->>Helper: dispatch POST with QueueMessageMetadata headers
  Helper->>Trigger: invoke on_manifest_create
  Trigger->>S3: getSize(s3_path) via statObject (HEAD)
  alt HEAD returns size
    S3-->>Trigger: object size
  else HEAD insufficient
    S3->>Trigger: Range GET fallback (bytes=0-0)
    S3-->>Trigger: size or 0
  end
  Trigger->>Trigger: shouldRetryManifestSizeLookup?
  alt missing and no valid persisted file_size
    Trigger-->>Helper: respond 503 manifest_size_not_found (retain for retry)
  else valid or existing fileSize present
    Trigger-->>Helper: respond 200 success (update or keep)
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through headers, bytes, and queues,
Fetching sizes, chasing tiny clues;
When HEAD failed, I peeked with a range,
Tuned the queue, so retries arrange;
Backfill reports and tests — nimble news.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'fix(backend): retry manifest size lookup' clearly and concisely summarizes the main change—addressing manifest file-size lookup failures by implementing retry logic instead of treating them as successful.
Description check	✅ Passed	The PR description is comprehensive and well-structured with Summary, Motivation, Business Impact, and a detailed Test Plan section with checkmarks confirming completion of required linting, type-checking, and unit tests.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/fix-manifest-file-size-retry

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

supabase/functions/_backend/triggers/queue_consumer.ts (1)

220-225: ⚡ Quick win

Keep the new queue tuning logs structured.

These new cloudlog() calls drop requestId and the batch metrics into an unstructured string, which makes the new concurrency/budget behavior harder to trace in production.

Proposed fix

-  cloudlog(`[${queueName}] Processing ${messagesToProcess.length} messages and skipping ${messagesToSkip.length} messages with concurrency ${processConcurrency}.`)
+  cloudlog({
+    requestId: c.get('requestId'),
+    message: `[${queueName}] Processing queue batch.`,
+    processingCount: messagesToProcess.length,
+    skippedCount: messagesToSkip.length,
+    concurrency: processConcurrency,
+  })

   // Archive messages after the configured retry budget is exhausted.
   if (messagesToSkip.length > 0) {
-    cloudlog(`[${queueName}] Archiving ${messagesToSkip.length} messages that exceeded the retry budget.`)
+    cloudlog({
+      requestId: c.get('requestId'),
+      message: `[${queueName}] Archiving messages that exceeded the retry budget.`,
+      archiveCount: messagesToSkip.length,
+    })
     await archive_queue_messages(c, db, queueName, messagesToSkip.map(msg => msg.msg_id))
   }

As per coding guidelines, "All endpoints must receive Hono Context<MiddlewareKeyVariables> object and use c.get('requestId') for structured logging with cloudlog()."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@supabase/functions/_backend/triggers/queue_consumer.ts` around lines 220 -
225, The new cloudlog calls that log concurrency and archiving (using
getQueueHttpConcurrency, processConcurrency, messagesToProcess, messagesToSkip,
queueName) must be converted to structured logs that include the Hono requestId
and batch metrics: obtain requestId via c.get('requestId') and pass it plus
metrics (e.g., queueName, processConcurrency, messagesToProcess.length,
messagesToSkip.length, and any batch id) as structured fields to cloudlog
instead of interpolated strings; update the cloudlog invocations around where
processConcurrency is computed and where archiving is logged to use the
context-derived requestId and explicit key/value fields so production tracing
remains consistent.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/backfill_manifest_file_sizes.mjs`:
- Around line 52-54: The current getArgValue logic returns the next argv token
even if it's another flag (e.g., "--apply"), causing silent no-ops; update
getArgValue to validate the candidate value: after finding const index =
process.argv.indexOf(name) ensure process.argv[index + 1] exists and that it
does NOT start with '--' (or otherwise indicate a flag), and if the check fails
return undefined or throw a clear error; reference the getArgValue function and
the process.argv[index + 1] lookup so you replace the unconditional return with
a guarded check that rejects missing/flag tokens.
- Around line 141-160: The current code always calls parseObjectSizeFromHeaders
even for failed (4xx/5xx) or partial responses, which lets error XML bodies or
206 responses without Content-Range produce bogus sizes; change the logic to
only derive size for successful range responses: check response.status first and
if status is >= 400 return without a size, and only call
parseObjectSizeFromHeaders when response.status === 206 AND contentRange exists
(do not accept Content-Length as the full object size for 206 if Content-Range
is missing); ensure you still cancel response.body and return
contentLength/contentRange/method/status/statusText but leave size undefined
when not safely derivable (use the existing parseObjectSizeFromHeaders,
contentRange, contentLength and response.status symbols to locate and adjust the
logic).
- Around line 24-31: The current loop calling config({ path: resolve(__dirname,
envPath) }) loads env files in an order where earlier files win; change the
loading so later files override earlier ones by either reversing the envPath
array order or (preferably) passing override: true for subsequent loads; update
the loop that iterates over envPath (the config call) to call config({ path:
resolve(__dirname, envPath), override: true }) for files that should take
precedence (e.g., .env.local and cloudflare .envs) so .local values override
base .env values.

In `@supabase/functions/_backend/utils/s3.ts`:
- Around line 218-242: When doing the ranged GET fallback, don't parse
Content-Length/Content-Range for non-success responses: after awaiting
fetch(url...) check res.ok (or res.status) and if the response is not
successful, cancel res.body, log the failure via the existing cloudlog call
(including status/statusText/requestId/fileId/reason) and return a sentinel
(e.g. undefined/null) instead of calling parseObjectSizeFromHeaders; only call
parseObjectSizeFromHeaders and return size when res.ok is true. Ensure you use
the same local symbols (res, parseObjectSizeFromHeaders, cloudlog, fileId,
reason, c.get('requestId')) so the change is localized.

---

Nitpick comments:
In `@supabase/functions/_backend/triggers/queue_consumer.ts`:
- Around line 220-225: The new cloudlog calls that log concurrency and archiving
(using getQueueHttpConcurrency, processConcurrency, messagesToProcess,
messagesToSkip, queueName) must be converted to structured logs that include the
Hono requestId and batch metrics: obtain requestId via c.get('requestId') and
pass it plus metrics (e.g., queueName, processConcurrency,
messagesToProcess.length, messagesToSkip.length, and any batch id) as structured
fields to cloudlog instead of interpolated strings; update the cloudlog
invocations around where processConcurrency is computed and where archiving is
logged to use the context-derived requestId and explicit key/value fields so
production tracing remains consistent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 68a850c4-d1ac-498f-b4a1-f16cbac57b41

📥 Commits

Reviewing files that changed from the base of the PR and between 70f5798 and c40dc07.

📒 Files selected for processing (6)

scripts/backfill_manifest_file_sizes.mjs
supabase/functions/_backend/triggers/on_manifest_create.ts
supabase/functions/_backend/triggers/queue_consumer.ts
supabase/functions/_backend/utils/s3.ts
tests/backend-alert-resilience.unit.test.ts
tests/queue-consumer-message-shape.unit.test.ts

codspeed-hq · 2026-05-18T15:20:25Z

Merging this PR will not alter performance

✅ 43 untouched benchmarks
⏩ 2 skipped benchmarks¹

_{Comparing codex/fix-manifest-file-size-retry (1120a7a) with main (22ea833)}

2 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

sonarqubecloud · 2026-05-18T15:51:39Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

fix(backend): retry manifest size lookup

c40dc07

riderx temporarily deployed to deepsec-pr May 18, 2026 15:15 — with GitHub Actions Inactive

github-code-quality Bot found potential problems May 18, 2026

View reviewed changes

Comment thread scripts/backfill_manifest_file_sizes.mjs Fixed

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread scripts/backfill_manifest_file_sizes.mjs

Comment thread scripts/backfill_manifest_file_sizes.mjs Outdated

Comment thread scripts/backfill_manifest_file_sizes.mjs

Comment thread supabase/functions/_backend/utils/s3.ts

fix(backend): address manifest review comments

1120a7a

riderx temporarily deployed to deepsec-pr May 18, 2026 15:45 — with GitHub Actions Inactive

riderx merged commit f9982c6 into main May 18, 2026
54 of 55 checks passed

riderx deleted the codex/fix-manifest-file-size-retry branch May 18, 2026 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(backend): retry manifest size lookup#2292

fix(backend): retry manifest size lookup#2292
riderx merged 2 commits into
mainfrom
codex/fix-manifest-file-size-retry

riderx commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codspeed-hq Bot commented May 18, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

riderx commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary (AI generated)

Motivation (AI generated)

Business Impact (AI generated)

Test Plan (AI generated)

Summary by CodeRabbit

Uh oh!

chatgpt-codex-connector Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codspeed-hq Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

sonarqubecloud Bot commented May 18, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

riderx commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading

codspeed-hq Bot commented May 18, 2026 •

edited

Loading