fix(router): coalesce uses unique jobIds + name-as-coalesceKey to stop silently dropping events#1230
Merged
zbigniewsobiecki merged 1 commit intodevfrom Apr 29, 2026
Conversation
…p silently dropping events
Live regression on prod 2026-04-29: user moved Linear issue MNG-422 from
planning to splitting at 16:13:25. Webhook decision logged "Coalesced
dispatch scheduled: splitting agent for work item MNG-422" — but no
worker ever spawned. Splitting agent silently dropped.
Root cause: `scheduleCoalescedJob` in `src/router/queue.ts` reused a
deterministic `jobId = coalesce:${coalesceKey}`. BullMQ's
`add(name, data, { jobId })` is a silent no-op when a job with that id
already exists in the completed/failed/active set. Since the queue
keeps completed jobs for 24h via `removeOnComplete: { age: 86400 }`,
ANY new event for the same coalesceKey within 24h after a previous
coalesced job either (a) collided with the prior 'completed'/'failed'
entry and was silently dropped, or (b) collided with an 'active' entry
and was either silently dropped (if the if/else-if chain fell through)
or explicitly rejected via the `activeExists: true` early return —
which the caller then logged as "Coalesced dispatch skipped: active
job already running" and dropped the event entirely. The user's
splitting webhook hit one of these paths.
Fix: switch to UNIQUE jobIds per call, store the coalesceKey as the
BullMQ "job name" instead. Supersede pass reads only delayed/waiting
jobs (the dedup target — multiple webhooks within the 10s window for
the same `(projectId, workItemId)`). Active/completed/failed prior
jobs are NOT consulted: an active job is busy doing the previous unit
of work and the new event becomes its own delayed dispatch behind it;
completed/failed jobs are done and the new event is real new intent.
The new jobId format is colon-free (`coalesce_<safeKey>_<ts>_<rand>`)
because BullMQ rejects custom ids with colons unless they have exactly
3 colon-separated parts (legacy repeatable-job compatibility); a
4-part timestamp-suffixed id would be rejected. The colon-free form
is also Docker-container-name-safe — earlier today's hotfix at
`src/router/container-manager.ts:485` had to sanitize `:` from the
deterministic ids for the same reason.
Caller (`src/router/webhook-processor.ts`) drops the `activeExists`
branch — active jobs no longer block new schedules. The supersede
branch + lock-cleanup loop continues to handle the
in-memory-lock-orphan case for any superseded delayed/waiting job.
Test coverage:
- `tests/unit/router/queue.test.ts`: 8 tests covering no-prior-job,
delayed supersede, waiting supersede, COMPLETED-doesn't-block (the
headline MNG-422 regression pin), FAILED-doesn't-block, ACTIVE-
doesn't-block, name-filter-isolation, unique-jobId-per-call,
colon-free-format.
- `tests/unit/router/webhook-processor.test.ts`: existing
`activeExists` test rewritten to pin the new contract — active
jobs do NOT block, locks ARE marked, decisionReason is "scheduled"
not "skipped".
- `tests/integration/coalesce-bullmq.test.ts`: real-Redis tests
updated to mirror the unique-id + name-filter contract; supersede
on delayed/waiting verified end-to-end. Completed/failed scenarios
documented as covered by the unit suite (moving real BullMQ jobs to
completed/failed requires worker-lock-token plumbing not worth
duplicating here).
Full unit suite green (475 files / 8715 tests). Integration suite
green for coalesce-bullmq.test.ts (4/4). Typecheck + lint clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🚨 Live regression fix
PR #1226's coalesce flow silently drops PM
status-changedevents when a prior coalesced job for the same(projectId, workItemId)is in any state other than'delayed'/'waiting'. Verified live on prod 2026-04-29 16:13:25 UTC: user movedMNG-422from planning to splitting; webhook decision loggedCoalesced dispatch scheduled: splitting agent for work item MNG-422; no worker ever spawned, splitting agent silently dropped.Root cause
scheduleCoalescedJobreused a deterministicjobId = coalesce:${coalesceKey}. BullMQ'sadd(name, data, { jobId })is a silent no-op when a job with that id already exists in the completed/failed/active set, and the queue keeps completed jobs for 24h viaremoveOnComplete: { age: 86400 }. The if/else-if chain handled'delayed' | 'waiting' | 'active'but FELL THROUGH for'completed'/'failed'/'paused'/'waiting-children'— silently dropping the new event. And even the'active'early return (activeExists: true) caused the caller to drop the event with a "Coalesced dispatch skipped: active job already running" decision reason — also wrong, because the user's status change is real new intent.Fix
coalesce_<safeKey>_<ts>_<rand>). BullMQ never silently drops it.coalesceKeyis stored as the BullMQ "job name" — the supersede pass reads onlydelayed/waitingjobs and filters by name to find prior pending events to remove.webhook-processor.ts) drops theactiveExistsbranch entirely. Supersede + lock-cleanup loop preserved for delayed/waiting case.Test plan
tests/unit/router/queue.test.ts— 8 tests covering no-prior-job, delayed/waiting supersede, COMPLETED/FAILED/ACTIVE non-blocking (the headline MNG-422 regression pins), name-filter-isolation, unique-id-per-call, colon-free-format.tests/unit/router/webhook-processor.test.ts— existingactiveExiststest rewritten to pin the new contract: active jobs do NOT block, locks ARE marked, decisionReason is "scheduled" not "skipped".tests/integration/coalesce-bullmq.test.ts— real-Redis tests updated for the unique-id + name-filter contract; supersede on delayed/waiting verified end-to-end. Completed/failed scenarios documented as covered by unit suite (moving real BullMQ jobs to completed/failed requires worker-lock-token plumbing not worth duplicating).coalesce-bullmq.test.ts4/4 green.ucho; verify both agents fire (planning runs; while it's still running, move to splitting; splitting fires once the planning worker exits OR fires its own delayed dispatch independently after the 10s window).🤖 Generated with Claude Code