fix(script): fan out full manifest backfill batches by riderx · Pull Request #2318 · Cap-go/capgo

riderx · 2026-05-21T12:50:49Z

Summary (AI generated)

Remove the storage metadata concurrency throttle from the manifest file-size backfill.
Let each worker fan out the full fixed 1000-row candidate batch at once.
Keep --concurrency accepted as a backward-compatible no-op so older commands still run.

Motivation (AI generated)

The backfill needs to process millions of manifest rows quickly. The DB/API candidate read remains bounded at 1000 rows, but storage metadata checks should use the full batch capacity per worker instead of a smaller extra throttle.

Business Impact (AI generated)

This increases production backfill throughput so missing manifest file sizes can be corrected faster, while preserving indexed manifest.id scans and bulk DB updates.

Test Plan (AI generated)

node --check scripts/backfill_manifest_file_sizes.mjs
bun scripts/backfill_manifest_file_sizes.mjs --help
bunx eslint --no-ignore scripts/backfill_manifest_file_sizes.mjs
git diff --check origin/main...HEAD
bun lint

Summary by CodeRabbit

Chores
- Updated manifest file size backfill script's concurrency model to improve parallel processing efficiency of storage operations.

coderabbitai · 2026-05-21T12:52:08Z

Warning

Rate limit exceeded

@riderx has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 8 minutes and 32 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c04b2093-7f8d-4b4b-8238-108b43bf0877

📥 Commits

Reviewing files that changed from the base of the PR and between 5e3686e and 42567da.

📒 Files selected for processing (1)

scripts/backfill_manifest_file_sizes.mjs

📝 Walkthrough

Walkthrough

scripts/backfill_manifest_file_sizes.mjs refactors its storage concurrency model to use a fan-out-based approach derived from batch size and worker count, replacing the --concurrency option with derived storageFanoutPerWorker. The core processCandidates function now fetches S3 object sizes in parallel via Promise.all instead of limiting concurrency per worker. The obsolete mapWithConcurrency helper is removed, and all progress reporting and logs are updated to use fan-out terminology.

Changes

Storage concurrency refactor

Layer / File(s)	Summary
CLI interface and concurrency model `scripts/backfill_manifest_file_sizes.mjs`	Help text marks `--concurrency` as backward-compatible no-op; option parsing removes concurrency reading and computes `storageFanoutPerWorker` from `batchSize` and `effectiveStorageFanout` from worker count; usage example updated to `--workers=16`; per-worker storage concurrency option removed from `processCandidates` call.
Storage fetch refactoring `scripts/backfill_manifest_file_sizes.mjs`	`mapWithConcurrency` helper removed; `processCandidates` refactored to fetch all candidate object sizes in parallel using `Promise.all` instead of per-worker concurrency limiting.
Report and logging updates `scripts/backfill_manifest_file_sizes.mjs`	Report object fields and startup log updated to record `storageFanoutPerWorker` and `effectiveStorageFanout` instead of previous concurrency fields.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Cap-go/capgo#2313: Refactors the same backfill_manifest_file_sizes.mjs script—reworking worker/concurrency handling and bulk backfill flow in parallel with this PR's concurrency model changes.
Cap-go/capgo#2292: Earlier PR that modified scripts/backfill_manifest_file_sizes.mjs; this PR extends that work by refactoring the concurrency and parallel fetch model.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: removing storage concurrency throttling to allow full batch fan-out during manifest backfill.
Description check	✅ Passed	The description covers the main summary, motivation, and business impact well, with a comprehensive test plan included. However, the Screenshots and Checklist sections required by the template are missing or incomplete.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codspeed-hq · 2026-05-21T12:58:26Z

Merging this PR will not alter performance

✅ 43 untouched benchmarks
⏩ 2 skipped benchmarks¹

_{Comparing codex/full-batch-manifest-backfill (42567da) with main (a114796)}

2 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

chatgpt-codex-connector · 2026-05-21T13:06:33Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/backfill_manifest_file_sizes.mjs`:
- Line 654: Update the help text for the --concurrency option (the string
currently saying "Backward-compatible no-op; each worker checks its full
1000-row batch at once.") to also explain the rationale for its removal—e.g.,
note that storage requests now use full-batch parallelism for maximum throughput
and that concurrency was previously limited per-worker—so the help message
clarifies why --concurrency is a no-op.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 74782cf1-1b67-4e27-a7c4-d5051b9e01fd

📥 Commits

Reviewing files that changed from the base of the PR and between 9523f13 and 5e3686e.

📒 Files selected for processing (1)

scripts/backfill_manifest_file_sizes.mjs

sonarqubecloud · 2026-05-21T13:41:52Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

hadessunxy-code

I checked the backfill change and the decoded-path fallback looks sensible, but I would be careful with the new fan-out behavior before merging.

--all defaults to 8 workers and each worker now Promise.alls a full 1000-row batch, so the default path can issue up to ~8,000 storage metadata requests at once. If R2/S3 starts returning transient 429/5xx responses, the retry loop will also sleep and retry those failed calls in large synchronized waves. That is a pretty big behavior change from the previous --concurrency cap, and it also makes the still-accepted --concurrency flag unable to protect production runs.

I would either keep a real storage concurrency limiter around the per-row getObjectSizeWithRetry calls, or make the full-batch fan-out opt-in / bounded by a new explicit high-watermark. That keeps the legacy backfill safe for production operators while still allowing the faster mode when the storage backend can absorb it.

fix(script): fan out full manifest backfill batches

5e3686e

riderx temporarily deployed to deepsec-pr May 21, 2026 12:51 — with GitHub Actions Inactive

riderx marked this pull request as ready for review May 21, 2026 13:06

coderabbitai Bot requested changes May 21, 2026

View reviewed changes

Comment thread scripts/backfill_manifest_file_sizes.mjs Outdated

docs(script): clarify manifest backfill fanout

45c7507

riderx temporarily deployed to deepsec-pr May 21, 2026 13:12 — with GitHub Actions Inactive

fix(script): retry decoded manifest storage keys

42567da

riderx temporarily deployed to deepsec-pr May 21, 2026 13:38 — with GitHub Actions Inactive

hadessunxy-code reviewed May 21, 2026

View reviewed changes

riderx merged commit bb60c38 into main May 21, 2026
42 checks passed

riderx deleted the codex/full-batch-manifest-backfill branch May 21, 2026 14:10

coderabbitai Bot mentioned this pull request May 21, 2026

fix(manifest): check upload-encoded storage keys #2320

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(script): fan out full manifest backfill batches#2318

fix(script): fan out full manifest backfill batches#2318
riderx merged 3 commits into
mainfrom
codex/full-batch-manifest-backfill

riderx commented May 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 21, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

codspeed-hq Bot commented May 21, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 21, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 21, 2026

Uh oh!

hadessunxy-code left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

riderx commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary (AI generated)

Motivation (AI generated)

Business Impact (AI generated)

Test Plan (AI generated)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

codspeed-hq Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

chatgpt-codex-connector Bot commented May 21, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 21, 2026

Quality Gate passed

Uh oh!

hadessunxy-code left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

riderx commented May 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading

codspeed-hq Bot commented May 21, 2026 •

edited

Loading