Skip to content

[codex] Chunk translation queue jobs#615

Merged
riderx merged 7 commits into
mainfrom
codex/chunk-translation-queue
May 1, 2026
Merged

[codex] Chunk translation queue jobs#615
riderx merged 7 commits into
mainfrom
codex/chunk-translation-queue

Conversation

@riderx
Copy link
Copy Markdown
Member

@riderx riderx commented May 1, 2026

Summary

  • split translation queue processing into small persisted batches so slow model calls do not make the whole page job fail before cache write
  • store partial translation progress in the server-side cache and enqueue continuation jobs until the full localized page is cached
  • add queue CPU headroom and clearer serialized error logging for future Cloudflare logs

Root cause

The live /id/ page was redirecting to English because the Indonesian translation cache was empty. The queue consumer was failing after ~108s wall time while translating the full page in one invocation, so it never wrote the cached localized HTML. The logged CPU time was only 69ms, so increasing CPU alone would not fix it.

Validation

  • bun run ci:verify:translation
  • bunx prettier --check apps/translation-worker/src/index.ts apps/translation-worker/wrangler.jsonc
  • cd apps/translation-worker && bunx wrangler deploy --dry-run

Summary by CodeRabbit

  • New Features

    • Incremental/resumable HTML translation that persists intermediate batches and continues across background jobs until a final translated page is produced.
  • Refactor

    • Safer, more efficient batch processing: smaller batches, one-batch-per-job progression, validation of partial state, fresher-serving logic (prefer stored results and repopulate cache), and standardized error logging.
  • Chores

    • Added persistent translation storage bindings, increased worker execution limits, and CI steps to ensure translation storage buckets exist for deployment.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds incremental, batch-by-batch HTML translation using a deterministic sourceHash and R2-backed storage for partial and final translations; enqueues reason: 'continue' jobs to translate one missing batch per job, persists/merges partial state in R2, and restores final results to Cache API. Error logging unified via errorMessage(...).

Changes

Cohort / File(s) Summary
Translation worker core
apps/translation-worker/src/index.ts
Implements incremental continuation with optional sourceHash, deterministic source hashing, R2-backed PartialTranslationState and StoredTranslatedResponse, reduced batching limits and TRANSLATION_BATCHES_PER_QUEUE_JOB = 1, per-job translation of a single missing batch, read/merge/persist partial state to R2, enqueue reason: 'continue' until completion, final render/write to R2 and Cache API, and standardized errorMessage(...) logging.
Worker configuration
apps/translation-worker/wrangler.jsonc
Adds limits.cpu_ms: 300000, binds TRANSLATION_STORE R2 bucket for production and development, and sets ai.binding plus vars.TRANSLATION_MODEL in env.development.
CI / infra workflow
.github/workflows/deploy-translation.yml
Adds pre-deploy step to create/ensure R2 buckets (capgo-translation-cache, capgo-translation-cache-development) with tolerant handling of “already exists” output.
Helpers / logging
(in worker)
Adds and uses errorMessage(...) helper for consistent error serialization across worker flows.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Queue as Queue Service
    participant Worker as Translation Worker
    participant R2 as R2 (Partial / Final)
    participant Origin as Origin HTML

    Client->>Queue: Enqueue translation job
    Queue-->>Worker: Deliver job (may include reason: 'continue', sourceHash)
    Worker->>Origin: Fetch source HTML
    Origin-->>Worker: Return HTML or redirect/non-HTML

    Worker->>R2: Lookup PartialTranslationState by sourceHash
    alt Partial exists
        R2-->>Worker: PartialTranslationState
    else No partial
        Worker->>Worker: Compute sourceHash, split into batches
    end

    Worker->>Worker: Translate up to one missing batch
    Worker->>R2: Persist updated PartialTranslationState

    alt More batches remain
        Worker->>Queue: Enqueue continuation job (reason: 'continue', sourceHash)
    else All batches complete
        Worker->>Worker: Render final translated HTML
        Worker->>R2: Write StoredTranslatedResponse (final) and delete PartialTranslationState
        Worker->>Cache: Populate Cache API with final result
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I hop through HTML, one batch at a time,
I stash small crumbs in R2 like rhyme,
With a hash I know where to pick up the trace,
I queue one more job and tidy the place,
Hooray — translations finished, neat and sublime!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Chunk translation queue jobs' accurately reflects the main objective of splitting translation queue processing into batches and implementing continuation jobs.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/chunk-translation-queue

Comment @coderabbitai help to get the list of available commands and usage tips.

@riderx riderx marked this pull request as ready for review May 1, 2026 17:05
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
apps/translation-worker/src/index.ts (2)

1177-1186: 💤 Low value

Consider using nullish coalescing assignment.

Static analysis suggests using ??= operator for simpler code.

♻️ Optional refactor using ??=
   let state = await readPartialTranslationState(requestUrl, locale, sourceHash)
-  if (!state) {
-    state = {
-      cacheVersion: TRANSLATION_CACHE_VERSION,
-      sourceHash,
-      locale,
-      translatedBatches: [],
-      updatedAt: Date.now(),
-    }
-  }
+  state ??= {
+    cacheVersion: TRANSLATION_CACHE_VERSION,
+    sourceHash,
+    locale,
+    translatedBatches: [],
+    updatedAt: Date.now(),
+  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/translation-worker/src/index.ts` around lines 1177 - 1186, The code
initializes state from readPartialTranslationState(...) and then assigns a
default object if falsy; replace that pattern with a nullish coalescing
assignment to simplify it: use the existing variable name state with the ??=
operator so that state ??= { cacheVersion: TRANSLATION_CACHE_VERSION,
sourceHash, locale, translatedBatches: [], updatedAt: Date.now() } after the
call to readPartialTranslationState(requestUrl, locale, sourceHash) — keep
references to readPartialTranslationState, state, and TRANSLATION_CACHE_VERSION
to locate and update the assignment.

669-677: 💤 Low value

Consider simplifying nested template literal.

Static analysis flagged the nested template literal at line 670. While functional, this can be simplified for readability.

♻️ Optional refactor to simplify
 function errorMessage(error: unknown): string {
-  if (error instanceof Error) return `${error.name}: ${error.message}${error.stack ? `\n${error.stack}` : ''}`
+  if (error instanceof Error) {
+    const base = `${error.name}: ${error.message}`
+    return error.stack ? `${base}\n${error.stack}` : base
+  }
   if (typeof error === 'string') return error
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/translation-worker/src/index.ts` around lines 669 - 677, The nested
template literal inside errorMessage makes the string construction harder to
read; simplify by building the message in steps instead of embedding a template
inside a template. In the errorMessage function, create a base string like
`${error.name}: ${error.message}` and, if error.stack exists, append '\n' +
error.stack (or use string concatenation) before returning; update the branch
for error instanceof Error accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@apps/translation-worker/src/index.ts`:
- Around line 1177-1186: The code initializes state from
readPartialTranslationState(...) and then assigns a default object if falsy;
replace that pattern with a nullish coalescing assignment to simplify it: use
the existing variable name state with the ??= operator so that state ??= {
cacheVersion: TRANSLATION_CACHE_VERSION, sourceHash, locale, translatedBatches:
[], updatedAt: Date.now() } after the call to
readPartialTranslationState(requestUrl, locale, sourceHash) — keep references to
readPartialTranslationState, state, and TRANSLATION_CACHE_VERSION to locate and
update the assignment.
- Around line 669-677: The nested template literal inside errorMessage makes the
string construction harder to read; simplify by building the message in steps
instead of embedding a template inside a template. In the errorMessage function,
create a base string like `${error.name}: ${error.message}` and, if error.stack
exists, append '\n' + error.stack (or use string concatenation) before
returning; update the branch for error instanceof Error accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 86d2cdb4-8aa9-4a62-8bc0-f28dac913e50

📥 Commits

Reviewing files that changed from the base of the PR and between 9d3650f and 0a5bfea.

📒 Files selected for processing (2)
  • apps/translation-worker/src/index.ts
  • apps/translation-worker/wrangler.jsonc

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0a5bfea484

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread apps/translation-worker/src/index.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
apps/translation-worker/src/index.ts (1)

605-613: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

MAX_BATCH_CHARS is still bypassed by a single oversized segment.

Because Line 607 only rolls the batch when currentBatch.length > 0, the first segment in a batch can be arbitrarily large. A long text node will still go through as one huge model call, which keeps the original timeout/wall-time failure mode alive for pages with large paragraphs or blobs.

Please split oversize segments before batching, or at minimum detect and fail them explicitly with a dedicated path instead of silently violating the cap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/translation-worker/src/index.ts` around lines 605 - 613, The batching
loop allows a single segment to exceed MAX_BATCH_CHARS because it only rolls the
batch when currentBatch.length > 0; update the logic in the loop over segments
to detect oversized segment.text and either (A) split segment.text into chunks
no larger than MAX_BATCH_CHARS and feed those chunks into the existing batching
logic (ensuring currentBatch/currentLength are updated per chunk), or (B)
explicitly surface a clear failure path when a single segment exceeds
MAX_BATCH_CHARS; modify the code around the for (const segment of segments) loop
and the usage of MAX_BATCH_CHARS, currentBatch, and currentLength so no
individual pushed item can bypass the character cap.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/translation-worker/src/index.ts`:
- Around line 1132-1133: The localized redirect created by
localizeRedirect(originResponse, requestUrl, locale) is never persisted to
cacheKey before the job is marked complete, causing perpetual MISSes; update the
isRedirect handling in the function that uses isRedirect/localizeRedirect to
store the localized redirect into the cache (using the same cacheKey used for
this job) and only then return the response object (or true where applicable).
Apply the same persistence step for the other early-return path referenced
around the 1162–1163 spot so that any 30x canonicalizations write the localized
redirect to cacheKey before completing the job.

---

Outside diff comments:
In `@apps/translation-worker/src/index.ts`:
- Around line 605-613: The batching loop allows a single segment to exceed
MAX_BATCH_CHARS because it only rolls the batch when currentBatch.length > 0;
update the logic in the loop over segments to detect oversized segment.text and
either (A) split segment.text into chunks no larger than MAX_BATCH_CHARS and
feed those chunks into the existing batching logic (ensuring
currentBatch/currentLength are updated per chunk), or (B) explicitly surface a
clear failure path when a single segment exceeds MAX_BATCH_CHARS; modify the
code around the for (const segment of segments) loop and the usage of
MAX_BATCH_CHARS, currentBatch, and currentLength so no individual pushed item
can bypass the character cap.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 25b2d30a-7de1-4266-ad27-70536c4154fa

📥 Commits

Reviewing files that changed from the base of the PR and between 0a5bfea and b697e08.

📒 Files selected for processing (1)
  • apps/translation-worker/src/index.ts

Comment thread apps/translation-worker/src/index.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe8b46ede5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread apps/translation-worker/src/index.ts
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/translation-worker/src/index.ts`:
- Around line 1284-1290: The branch handling source.type === 'response'
currently only persists redirects and always clears the pending marker and
returns true; update it to distinguish stable terminal responses (e.g.,
localized 404s) from transient failures and persist stable non-redirect
responses exactly like redirects: detect non-redirect but stable responses, call
toCachedResponse(source.response.clone()) and await cacheTranslatedResponse(env,
requestUrl, locale, cacheKey, cachedResponse) before calling
deletePartialTranslationStateSafely(env, requestUrl, locale) and returning true;
keep transient failures from being marked complete (do not clear the pending
marker or return true for them). Ensure you use the existing helpers isRedirect,
toCachedResponse, cacheTranslatedResponse, and
deletePartialTranslationStateSafely and preserve requestUrl, locale, cacheKey,
and env in the implementation.
- Around line 1321-1337: When scheduling a continuation job, refresh the pending
TTL so the partial-translation pending marker doesn't expire while continuation
batches are running; modify the continuation branch around
writePartialTranslationState and env.TRANSLATION_QUEUE.send (the block that sets
cacheVersion/locale/sourceHash/translatedBatches/updatedAt and then calls
env.TRANSLATION_QUEUE.send) to also set/extend a pending expiry (e.g.,
pendingUntil or pendingExpires = Date.now() + PENDING_PENDING_TTL_MS) and
persist that field in writePartialTranslationState before enqueuing the
continuation; ensure the same constant used for pending TTL (create or reuse
PENDING_PENDING_TTL_MS or a descriptive name) is applied so fresh requests see
the marker and do not enqueue duplicate miss/stale jobs.
- Around line 1467-1475: The current code can throw when calling
caches.default.put(...) (or when enqueueTranslationSafely(...) runs) and that
propagates, turning a valid R2 hit into a fallback; wrap the cache repopulation
and the enqueue call in their own try/catch blocks so any errors are
swallowed/logged but do not prevent returning the storedResponse. Specifically,
in the block after readStoredTranslatedResponse and readTranslatedAt, call await
caches.default.put(cacheKey, storedResponse.clone()) inside a try { ... } catch
(err) { /* log */ } and likewise call enqueueTranslationSafely(env, requestUrl,
locale, 'stale') inside a separate try/catch so failures there don’t throw;
finally always return withResponseHeaders(storedResponse, isStale ? 'STALE' :
'HIT', isHead).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b48b2e01-8511-42ca-9cd4-d89f8bbaff45

📥 Commits

Reviewing files that changed from the base of the PR and between b697e08 and f663f2b.

📒 Files selected for processing (3)
  • .github/workflows/deploy-translation.yml
  • apps/translation-worker/src/index.ts
  • apps/translation-worker/wrangler.jsonc
🚧 Files skipped from review as they are similar to previous changes (1)
  • apps/translation-worker/wrangler.jsonc

Comment thread apps/translation-worker/src/index.ts Outdated
Comment thread apps/translation-worker/src/index.ts
Comment thread apps/translation-worker/src/index.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4e8852dcda

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread apps/translation-worker/src/index.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/translation-worker/src/index.ts`:
- Around line 350-355: canonicalTranslationStoreUrl currently drops the entire
query string which causes different query variants to alias to the same R2 store
key (while fetchEnglishOrigin preserves the search), so change
canonicalTranslationStoreUrl to use a deterministic normalized-query policy
instead of blanking search: parse requestUrl.searchParams, filter out
volatile/tracking params (e.g., utm_*, session ids) or allowlist only the query
keys that affect translation, sort the remaining keys and values
deterministically, rebuild storeUrl.search from that canonicalized string, and
return storeUrl.toString(); reference canonicalTranslationStoreUrl and ensure
its normalization policy is consistent with how fetchEnglishOrigin treats
queries.
- Around line 1316-1320: The current branch calls cacheTranslatedResponse() for
any stable terminal Response which will call writeStoredTranslatedResponse()
that uses response.text(), corrupting non-text/binary bodies; update the logic
in the branch that calls toCachedResponse/cacheTranslatedResponse to first
inspect source.response.headers.get('content-type') and only use the existing
text-path for textual content types (e.g., text/*, application/json,
application/xml, or those with charset=utf-8); for non-text/binary types, change
toCachedResponse and writeStoredTranslatedResponse to read the body as raw bytes
(response.arrayBuffer()), encode it (e.g., base64) and persist a flag/metadata
indicating binary encoding so reads decode properly, or simply skip caching
binary responses—ensure the change references isStableTerminalResponse,
toCachedResponse, cacheTranslatedResponse and writeStoredTranslatedResponse so
the text vs binary handling is implemented consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f584e077-ebbb-4dd4-b905-d7302abbf7df

📥 Commits

Reviewing files that changed from the base of the PR and between 4e8852d and c9fec9d.

📒 Files selected for processing (1)
  • apps/translation-worker/src/index.ts

Comment thread apps/translation-worker/src/index.ts
Comment thread apps/translation-worker/src/index.ts
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 1, 2026

@riderx riderx merged commit bbfb26f into main May 1, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant