Skip to content

fix(router): coalesce uses unique jobIds + name-as-coalesceKey to stop silently dropping events#1230

Merged
zbigniewsobiecki merged 1 commit intodevfrom
fix/coalesce-unique-jobids-dedup-key
Apr 29, 2026
Merged

fix(router): coalesce uses unique jobIds + name-as-coalesceKey to stop silently dropping events#1230
zbigniewsobiecki merged 1 commit intodevfrom
fix/coalesce-unique-jobids-dedup-key

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

🚨 Live regression fix

PR #1226's coalesce flow silently drops PM status-changed events when a prior coalesced job for the same (projectId, workItemId) is in any state other than 'delayed' / 'waiting'. Verified live on prod 2026-04-29 16:13:25 UTC: user moved MNG-422 from planning to splitting; webhook decision logged Coalesced dispatch scheduled: splitting agent for work item MNG-422; no worker ever spawned, splitting agent silently dropped.

Root cause

scheduleCoalescedJob reused a deterministic jobId = coalesce:${coalesceKey}. BullMQ's add(name, data, { jobId }) is a silent no-op when a job with that id already exists in the completed/failed/active set, and the queue keeps completed jobs for 24h via removeOnComplete: { age: 86400 }. The if/else-if chain handled 'delayed' | 'waiting' | 'active' but FELL THROUGH for 'completed' / 'failed' / 'paused' / 'waiting-children' — silently dropping the new event. And even the 'active' early return (activeExists: true) caused the caller to drop the event with a "Coalesced dispatch skipped: active job already running" decision reason — also wrong, because the user's status change is real new intent.

Fix

  • Each schedule produces a UNIQUE jobId (coalesce_<safeKey>_<ts>_<rand>). BullMQ never silently drops it.
  • coalesceKey is stored as the BullMQ "job name" — the supersede pass reads only delayed/waiting jobs and filters by name to find prior pending events to remove.
  • Active/completed/failed prior jobs are NOT consulted: an active job is busy with the previous unit of work and the new event becomes its own delayed dispatch behind it; completed/failed jobs are done and the new event is real new intent.
  • The new jobId format is colon-free because BullMQ rejects custom ids with colons unless they have exactly 3 colon-separated parts (legacy repeatable-job compatibility). The colon-free form also avoids the Docker-container-name issue patched earlier today.
  • Caller (webhook-processor.ts) drops the activeExists branch entirely. Supersede + lock-cleanup loop preserved for delayed/waiting case.

Test plan

  • Unit: tests/unit/router/queue.test.ts — 8 tests covering no-prior-job, delayed/waiting supersede, COMPLETED/FAILED/ACTIVE non-blocking (the headline MNG-422 regression pins), name-filter-isolation, unique-id-per-call, colon-free-format.
  • Unit: tests/unit/router/webhook-processor.test.ts — existing activeExists test rewritten to pin the new contract: active jobs do NOT block, locks ARE marked, decisionReason is "scheduled" not "skipped".
  • Integration: tests/integration/coalesce-bullmq.test.ts — real-Redis tests updated for the unique-id + name-filter contract; supersede on delayed/waiting verified end-to-end. Completed/failed scenarios documented as covered by unit suite (moving real BullMQ jobs to completed/failed requires worker-lock-token plumbing not worth duplicating).
  • Full unit suite: 475 files / 8715 tests green.
  • Integration: coalesce-bullmq.test.ts 4/4 green.
  • Typecheck + lint clean.
  • Post-deploy: move a Linear issue through planning → splitting on prod ucho; verify both agents fire (planning runs; while it's still running, move to splitting; splitting fires once the planning worker exits OR fires its own delayed dispatch independently after the 10s window).

🤖 Generated with Claude Code

…p silently dropping events

Live regression on prod 2026-04-29: user moved Linear issue MNG-422 from
planning to splitting at 16:13:25. Webhook decision logged "Coalesced
dispatch scheduled: splitting agent for work item MNG-422" — but no
worker ever spawned. Splitting agent silently dropped.

Root cause: `scheduleCoalescedJob` in `src/router/queue.ts` reused a
deterministic `jobId = coalesce:${coalesceKey}`. BullMQ's
`add(name, data, { jobId })` is a silent no-op when a job with that id
already exists in the completed/failed/active set. Since the queue
keeps completed jobs for 24h via `removeOnComplete: { age: 86400 }`,
ANY new event for the same coalesceKey within 24h after a previous
coalesced job either (a) collided with the prior 'completed'/'failed'
entry and was silently dropped, or (b) collided with an 'active' entry
and was either silently dropped (if the if/else-if chain fell through)
or explicitly rejected via the `activeExists: true` early return —
which the caller then logged as "Coalesced dispatch skipped: active
job already running" and dropped the event entirely. The user's
splitting webhook hit one of these paths.

Fix: switch to UNIQUE jobIds per call, store the coalesceKey as the
BullMQ "job name" instead. Supersede pass reads only delayed/waiting
jobs (the dedup target — multiple webhooks within the 10s window for
the same `(projectId, workItemId)`). Active/completed/failed prior
jobs are NOT consulted: an active job is busy doing the previous unit
of work and the new event becomes its own delayed dispatch behind it;
completed/failed jobs are done and the new event is real new intent.

The new jobId format is colon-free (`coalesce_<safeKey>_<ts>_<rand>`)
because BullMQ rejects custom ids with colons unless they have exactly
3 colon-separated parts (legacy repeatable-job compatibility); a
4-part timestamp-suffixed id would be rejected. The colon-free form
is also Docker-container-name-safe — earlier today's hotfix at
`src/router/container-manager.ts:485` had to sanitize `:` from the
deterministic ids for the same reason.

Caller (`src/router/webhook-processor.ts`) drops the `activeExists`
branch — active jobs no longer block new schedules. The supersede
branch + lock-cleanup loop continues to handle the
in-memory-lock-orphan case for any superseded delayed/waiting job.

Test coverage:
- `tests/unit/router/queue.test.ts`: 8 tests covering no-prior-job,
  delayed supersede, waiting supersede, COMPLETED-doesn't-block (the
  headline MNG-422 regression pin), FAILED-doesn't-block, ACTIVE-
  doesn't-block, name-filter-isolation, unique-jobId-per-call,
  colon-free-format.
- `tests/unit/router/webhook-processor.test.ts`: existing
  `activeExists` test rewritten to pin the new contract — active
  jobs do NOT block, locks ARE marked, decisionReason is "scheduled"
  not "skipped".
- `tests/integration/coalesce-bullmq.test.ts`: real-Redis tests
  updated to mirror the unique-id + name-filter contract; supersede
  on delayed/waiting verified end-to-end. Completed/failed scenarios
  documented as covered by the unit suite (moving real BullMQ jobs to
  completed/failed requires worker-lock-token plumbing not worth
  duplicating here).

Full unit suite green (475 files / 8715 tests). Integration suite
green for coalesce-bullmq.test.ts (4/4). Typecheck + lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@zbigniewsobiecki zbigniewsobiecki merged commit 096fd40 into dev Apr 29, 2026
9 checks passed
@zbigniewsobiecki zbigniewsobiecki deleted the fix/coalesce-unique-jobids-dedup-key branch April 29, 2026 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant