From dd88672b682545e77c777f3a6d65734a5a468789 Mon Sep 17 00:00:00 2001 From: Zbigniew Sobiecki Date: Sun, 26 Apr 2026 16:20:04 +0000 Subject: [PATCH 1/7] docs(spec/plans): add spec 015 + plans for router dispatch failure recovery Spec captures the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350): a transient capacity miss or Docker error during worker spawn turns a webhook-driven job into a permanently failed BullMQ entry while stranding the work-item / agent-type locks for up to 30 minutes, silently rejecting subsequent webhooks for the same work item. Decomposed into two plans with safety-net-first sequencing: plan 1 hooks the BullMQ failed event to release locks on every dispatch failure path; plan 2 replaces the throw-on-capacity with a wait-for-slot semaphore, adds bounded retry with exponential backoff, and a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../1-failed-event-lock-compensation.md | 219 ++++++++++++++++ .../2-wait-for-slot-and-retry-classifier.md | 246 ++++++++++++++++++ .../_coverage.md | 39 +++ ...15-router-job-dispatch-failure-recovery.md | 130 +++++++++ 4 files changed, 634 insertions(+) create mode 100644 docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md create mode 100644 docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md create mode 100644 docs/plans/015-router-job-dispatch-failure-recovery/_coverage.md create mode 100644 docs/specs/015-router-job-dispatch-failure-recovery.md diff --git a/docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md b/docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md new file mode 100644 index 00000000..6606b36f --- /dev/null +++ b/docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md @@ -0,0 +1,219 @@ +--- +id: 015 +slug: router-job-dispatch-failure-recovery +plan: 1 +plan_slug: failed-event-lock-compensation +level: plan +parent_spec: docs/specs/015-router-job-dispatch-failure-recovery.md +depends_on: [] +status: pending +--- + +# 015/1: Failed-event lock compensation + decision-reason taxonomy + +> Part 1 of 2 in the 015-router-job-dispatch-failure-recovery plan. See [parent spec](../../specs/015-router-job-dispatch-failure-recovery.md). + +## Summary + +This plan closes the **stranded-lock** half of the spec's bug class. It hooks BullMQ's existing `worker.on('failed')` event on both queues to release every in-memory lock entry that the webhook → enqueue path acquires (work-item lock, agent-type lock, recently-dispatched dedup mark) when dispatch ultimately fails. After this plan ships, no dispatch failure — capacity throw, Docker spawn error, or any future throw site — can leave a wedged lock that mis-rejects subsequent webhooks for the same work item. + +It also splits the webhook decision-reason vocabulary that Step 8's lock check emits into three distinguishable states: **Job queued** (success path, unchanged wording aside from existing `Job queued: ...`), **Awaiting worker slot** (lock held *and* a corresponding active worker / queued job is reachable — the healthy in-flight case), and **Work item locked (no active dispatch)** (lock held but no active worker and no waiting job — the wedged-lock canary). The third reason is a regression invariant: after this plan ships, its presence in webhook logs means a code path acquired a lock without registering its compensation — which the plan's tests prevent. + +This plan does NOT change the dispatch contract — `guardedSpawn` still throws on capacity, `attempts: 1` is still set, jobs still die in the failed set after one bad dispatch. The user-visible improvement is "no more wedged locks" — half the original incident is gone. The other half (lost jobs) is delivered by Plan 2. + +**Components delivered:** +- New compensator that maps `(jobData) → (projectId, workItemId, agentType)` and releases all three locks. Lives in a new module under `src/router/`. +- `worker.on('failed')` handler in `src/router/bullmq-workers.ts` extended to invoke the compensator for both `cascade-jobs` and `cascade-dashboard-jobs` queues. +- `src/router/webhook-processor.ts` Step 8 (work-item lock check) emits one of three decision reasons by correlating the lock count with the active-worker registry + the BullMQ queue's waiting/active counts. +- New tests: failed-event compensator releases locks for each lock kind across both queues; decision-reason taxonomy covers all three states. + +**Deferred to later plans in this spec:** +- Wait-for-slot replacement of the capacity throw — Plan 2. +- `attempts > 1` + retry classifier — Plan 2. +- CLAUDE.md update covering both halves of the new contract — Plan 2. +- Replacement of the existing `'processFn throws when at capacity'` test — Plan 2 (this plan adds new tests around the failed-event hook; the throw-on-capacity test stays green here because the throw still happens). + +--- + +## Spec ACs satisfied by this plan + +- Spec AC #4 (no stranded lock after any dispatch failure path) — **full** +- Spec AC #5 (decision reason distinguishes in-flight from wedged-lock) — **full** +- Spec AC #6 (wedged-lock canary never fires under normal operation) — **full** +- Spec AC #7 (manual-run path continues to bypass the lock and dispatch successfully) — **full** (regression test pinning the existing bypass) +- Spec AC #8 (clean-exit path identical to today) — **full** (regression test pinning the existing exit path) +- Spec AC #9 (test coverage updated) — **partial** (adds new failed-event tests; old throw-on-capacity test is replaced by Plan 2) + +--- + +## Depends On + +None. This plan is a strict additive layer on existing dispatch paths. It can ship independently of Plan 2. + +--- + +## Detailed Task List (TDD) + +### 1. Lock compensator module + +**Tests first** (`tests/unit/router/dispatch-compensator.test.ts`): + +- `releaseLocksForFailedJob — releases work-item, agent-type, and recently-dispatched marks for a CascadeJob with all three identifiers` — unit — call compensator with `{ type: 'linear', projectId: 'p1', workItemId: 'w1', /* …enough payload that extractAgentType returns 'implementation' */ }`; assert all three lock-module spies were called with `('p1', 'w1', 'implementation')` (or `('p1', 'implementation')` for agent-type / recently-dispatched). Expected red: `Error: Cannot find module './dispatch-compensator'`. +- `releaseLocksForFailedJob — no-ops cleanly when projectId is null` — unit — pass a job whose `extractProjectIdFromJob` resolves to `null` (e.g. a foreign-provider payload); assert no lock-module spies were called and the function resolved without throw. Expected red: same module-not-found error first; after creating the module with a stub that always calls clearWorkItemEnqueued, this test fails with `expect(mockClearWorkItemEnqueued).not.toHaveBeenCalled()`. +- `releaseLocksForFailedJob — releases agent-type-lock + recently-dispatched even when workItemId is undefined` — unit — pass a manual-run job that has `projectId` and `agentType` but no `workItemId`; assert work-item-lock spy was NOT called, agent-type-lock and recently-dispatched spies WERE called. Expected red: as above; or `expect(mockClearWorkItemEnqueued).not.toHaveBeenCalled()` if the implementation calls it unconditionally. +- `releaseLocksForFailedJob — handles a DashboardJob (manual-run) without throwing` — unit — pass `{ type: 'manual-run', projectId, workItemId, agentType }`; assert spies as appropriate. Expected red: type narrowing failure or `extractAgentType` returning undefined. +- `releaseLocksForFailedJob — captureException when an extractor throws` — unit — mock `extractProjectIdFromJob` to throw; assert `captureException` is called and the function still resolves (does not propagate). Expected red: unhandled rejection from the mocked extractor. + +**Implementation** (`src/router/dispatch-compensator.ts`): +- Export `async function releaseLocksForFailedJob(jobData: CascadeJob | DashboardJobData): Promise`. +- Resolve the trio via existing extractors imported from `src/router/worker-env.ts`: `extractProjectIdFromJob`, `extractWorkItemId`, `extractAgentType`. +- Call `clearWorkItemEnqueued(projectId, workItemId, agentType)` only when all three are defined. +- Call `clearAgentTypeEnqueued(projectId, agentType)` whenever `projectId && agentType`. +- Call `clearRecentlyDispatched(projectId, agentType, workItemId)` whenever `projectId && agentType` (we'll add this exported helper to `agent-type-lock.ts` — see section 2). +- Wrap each call in a try/catch that funnels to `captureException` with `tags: { source: 'dispatch_compensator' }` and a structured log; never propagate so a compensator failure doesn't crash the BullMQ worker. + +### 2. New `clearRecentlyDispatched` exported helper + +**Tests first** (`tests/unit/router/agent-type-lock.test.ts` — extend existing file): + +- `clearRecentlyDispatched — removes the dedup entry for a (projectId, agentType, dedupScope) key set by markRecentlyDispatched` — unit — call `markRecentlyDispatched('p1', 'implementation', 'w1')`; assert `wasRecentlyDispatched('p1', 'implementation', 'w1')` returns `true`; call new `clearRecentlyDispatched('p1', 'implementation', 'w1')`; assert `wasRecentlyDispatched(...)` returns `false`. Expected red: `clearRecentlyDispatched is not a function`. +- `clearRecentlyDispatched — no-op when key was not previously marked` — unit — call clear without prior mark; assert no throw. Expected red: same as above. +- `clearRecentlyDispatched — leaves entries for other (agentType, scope) keys untouched` — unit — set marks for two distinct keys, clear one, assert the other still present. Expected red: same as above. + +**Implementation** (`src/router/agent-type-lock.ts`): +- Export a new function `clearRecentlyDispatched(projectId, agentType, dedupScope?)` that deletes the corresponding key from `dedupMap`. +- Do NOT change the existing `markRecentlyDispatched` semantics or the `DEDUP_TTL_MS` value — this helper is purely additive, used only by the failed-event compensator. + +### 3. Failed-event hook in BullMQ worker factory + +**Tests first** (`tests/unit/router/bullmq-workers.test.ts` — extend existing file): + +- `worker.on('failed') invokes releaseLocksForFailedJob with job.data` — unit — construct a Worker via `createQueueWorker` with a `processFn` that throws; emit `'failed'` synthetically (or invoke the registered handler directly); assert the compensator spy was called with the job data. Expected red: spy not called (existing `worker.on('failed')` only logs + Sentries today). +- `worker.on('failed') still logs and Sentries on top of compensating` — unit — assert both the existing log + `captureException` calls happen AND the compensator spy is invoked. Expected red: compensator spy not called. +- `worker.on('failed') swallows compensator throws` — unit — mock the compensator to reject; emit `'failed'`; assert the BullMQ worker factory does not propagate the rejection (no unhandled rejection in the test). Expected red: unhandled rejection, or factory under test crashes. +- `worker.on('failed') is wired for both cascade-jobs and cascade-dashboard-jobs queues` — unit — call `startWorkerProcessor()` with two distinct mock `createQueueWorker` returns; assert both registered a `failed` handler that calls the compensator. Expected red: only one handler wired (or none). + +**Implementation** (`src/router/bullmq-workers.ts`): +- Inside `createQueueWorker`'s existing `worker.on('failed', ...)` handler, after the existing logger + `captureException` calls, invoke `releaseLocksForFailedJob(job.data)` if `job` is defined. Wrap in try/catch with `captureException` so a compensator throw does not poison the worker. + +### 4. Three-way decision-reason taxonomy in webhook-processor + +**Tests first** (`tests/unit/router/webhook-processor.test.ts` — extend existing file): + +- `Step 8 — emits 'Awaiting worker slot' when lock count >= 1 AND at least one active worker is registered for (projectId, workItemId, agentType)` — unit — set up `enqueuedMap` to register one entry; mock `getActiveWorkers()` to include a matching worker; expect the returned `decisionReason` to start with `Awaiting worker slot:`. Expected red: today's reason starts with `Work item locked:`. +- `Step 8 — emits 'Awaiting worker slot' when lock count >= 1 AND a matching job is in BullMQ waiting/active state` — integration — push a real BullMQ job (mocked Redis or test connection) into the waiting state; set the in-memory lock; assert the decision reason. Expected red: today's reason as above. +- `Step 8 — emits 'Work item locked (no active dispatch)' when lock count >= 1 AND neither active worker nor queued job exists for the trio` — unit — set the lock entry but ensure both `getActiveWorkers()` returns an empty list and the queue's `getJobs(['waiting','active'])` returns an empty list. Expected red: today's reason wording. +- `Step 8 — preserves existing log fields (source, projectId, workItemId, blockedAgentType, reason)` — unit — capture the log call; assert all fields present. Expected red: log structure changed unintentionally. +- `Step 8 — does not call queue.getJobs when lock count is 0` — unit — assert no queue lookup happens on the happy path (no perf regression). Expected red: an unexpected queue call (which is a sign the implementation does the lookup unconditionally). + +**Implementation** (`src/router/webhook-processor.ts` and a new helper): +- Add a helper, e.g. `classifyLockState({ projectId, workItemId, agentType }): Promise<'awaiting-slot' | 'wedged' | 'unknown'>`, in a new module `src/router/lock-state-classifier.ts` (or co-located helper file). The helper: + - Returns `'awaiting-slot'` if `getActiveWorkers()` includes a worker whose `(projectId, workItemId, agentType)` matches OR `jobQueue.getJobs(['waiting','active'])` returns at least one matching job (matched by the same trio extracted via the existing extractors). + - Returns `'wedged'` if the lock count is non-zero per the existing in-memory map and neither correlation matches. + - Returns `'unknown'` only on classifier error (queue lookup throws); falls back to `'awaiting-slot'` for behavioral safety (do NOT mis-emit the wedged-lock canary on an error path). +- In `webhook-processor.ts:182-198`, when `lockStatus.locked` is true, call the classifier and pick one of: + - `Awaiting worker slot: ${lockStatus.reason}` for `awaiting-slot` and `unknown`. + - `Work item locked (no active dispatch): ${lockStatus.reason}` for `wedged`. **Also call `captureException` with a synthetic Error tagged `source: 'wedged_lock_canary'` and structured `extra` so the regression invariant is loud.** +- Keep the existing `result.onBlocked?.()` call path untouched. +- Augment `getActiveWorkers()` in `src/router/active-workers.ts` to return `projectId / workItemId / agentType` for each entry (currently returns only `{jobId, startedAt}`). Keep callers unchanged (extend the shape; don't break existing readers). + +### 5. Active-workers shape extension + +**Tests first** (`tests/unit/router/active-workers.test.ts` — extend existing file): + +- `getActiveWorkers — returns projectId, workItemId, agentType for each tracked worker` — unit — register a worker via `activeWorkers.set` with all three identifiers; assert `getActiveWorkers()` includes them. Expected red: returned shape currently lacks the trio (only `jobId` + `startedAt` today). +- `getActiveWorkers — backwards-compatible callers (existing tests pinning jobId+startedAt) still pass` — unit — existing assertions on `jobId` and `startedAt` continue to pass. Expected red: only fails if the migration removes the old fields. + +**Implementation** (`src/router/active-workers.ts`): +- Extend `getActiveWorkers()`'s return type to `Array<{ jobId: string; startedAt: Date; projectId?: string; workItemId?: string; agentType?: string }>`. Map from the existing `ActiveWorker` shape (the trio is already stored there per the spec context). + +--- + +## Test Plan + +### Unit tests +- [ ] `tests/unit/router/dispatch-compensator.test.ts`: 5 tests covering compensator behavior across job kinds + extractor-throw path +- [ ] `tests/unit/router/agent-type-lock.test.ts`: +3 tests for `clearRecentlyDispatched` +- [ ] `tests/unit/router/bullmq-workers.test.ts`: +4 tests for failed-event hook (logging + compensation + swallowing + both queues) +- [ ] `tests/unit/router/webhook-processor.test.ts`: +5 tests for the three-way decision-reason taxonomy +- [ ] `tests/unit/router/active-workers.test.ts`: +2 tests for the extended `getActiveWorkers` shape +- [ ] `tests/unit/router/lock-state-classifier.test.ts` (new): 4 tests covering `awaiting-slot` / `wedged` / `unknown` branches plus the fallback safety on classifier error + +### Integration tests +- [ ] `tests/integration/router/dispatch-failure-compensation.test.ts` (new): exercises the real BullMQ in-memory + the real lock modules. Scenarios: + - Enqueue a job whose processFn throws → assert all three locks are released by `worker.on('failed')` before the test resolves; assert a follow-up webhook for the same trio is NOT blocked. + - Enqueue a job, let it succeed → assert locks are released by the existing exit path (regression test for AC #8). + - Enqueue a manual-run job whose processFn throws → assert the manual-run lock-bypass (no work-item lock acquired in the first place) and that agent-type-lock + recently-dispatched are released. + +### Acceptance tests +- [ ] AC #4: integration test "subsequent webhook not rejected after failed dispatch" +- [ ] AC #5: unit tests for the three decision-reason branches +- [ ] AC #6: wedged-lock branch emits the `wedged_lock_canary` Sentry tag — captured via spy +- [ ] AC #7: manual-run integration test pins the lock-bypass behavior +- [ ] AC #8: clean-exit integration test pins identical behavior +- [ ] AC #9 (partial): new tests added; existing throw-on-capacity test stays green (Plan 2 replaces it) + +--- + +## Manual Verification (for `[manual]`-tagged ACs only) + +n/a — all ACs auto-tested. + +--- + +## Acceptance Criteria (per-plan, testable) + +1. The failed-event compensator releases work-item-lock, agent-type-lock, and recently-dispatched marks for any job whose dispatch fails — for both `cascade-jobs` and `cascade-dashboard-jobs` queues. +2. A webhook for `(projectId, workItemId, agentType)` arriving immediately after a dispatch failure for the same trio is NOT rejected with a `Work item locked` decision reason caused by stranded in-memory state. +3. Webhook decision logs distinguish three states for a held lock: `Job queued` (success), `Awaiting worker slot: …` (in-flight, healthy), `Work item locked (no active dispatch): …` (wedged-lock canary). Each is emitted from the appropriate branch of the lock-state classifier. +4. The wedged-lock branch additionally fires a Sentry capture tagged `source: 'wedged_lock_canary'` so the invariant is observable in production. +5. The dashboard manual-run path continues to bypass the work-item lock and dispatch normally even when the lock count for the trio is non-zero. +6. A clean-exit successful run continues to release locks via the existing `cleanupWorker` path; no double-release race appears. +7. All new/modified code has corresponding tests written before the implementation. +8. `npm run build` passes. +9. `npm test` passes (unit projects). +10. `npm run test:integration` passes for the new integration test. +11. `npm run lint` passes. +12. `npm run typecheck` passes. +13. All documentation listed in this plan's Documentation Impact has been updated. + +--- + +## Documentation Impact (this plan only) + +| File | Change | +|---|---| +| `CHANGELOG.md` | Entry under the next release: "Router: dispatch failures now release in-memory work-item / agent-type / dedup locks via BullMQ failed-event compensation. Webhook decision reasons split into three states (Job queued / Awaiting worker slot / Work item locked — no active dispatch). The third reason is a regression canary." | + +CLAUDE.md is intentionally NOT updated by this plan; Plan 2 ships the unified passage covering both halves of the new contract. Updating it twice would risk in-flight inconsistency. + +--- + +## Out of Scope (this plan) + +- Replacing `guardedSpawn`'s capacity throw with a wait-for-slot semaphore — Plan 2. +- Changing `attempts: 1` defaults on either queue or adding an exponential-backoff config — Plan 2. +- The dispatch-error classifier (`UnrecoverableError` vs transient retry-worthy) — Plan 2. +- Replacing the `'processFn throws when at capacity'` test in `tests/unit/router/worker-manager.test.ts` — Plan 2 (this plan leaves it green; it remains a true assertion of current behavior until Plan 2 changes the contract). +- CLAUDE.md update — Plan 2. +- A failed-set inspection UI — out of scope per spec. +- Cleanup of the dead `linear-1777217350854-2qvhjo` job in prod Redis — operational, out of scope per spec. + +--- + +## Progress + + +- [ ] AC #1 (compensator releases all 3 locks on both queues) +- [ ] AC #2 (subsequent webhook not blocked by stranded state) +- [ ] AC #3 (three decision-reason branches) +- [ ] AC #4 (wedged-lock Sentry canary) +- [ ] AC #5 (manual-run bypass regression) +- [ ] AC #6 (clean-exit regression) +- [ ] AC #7 (TDD discipline) +- [ ] AC #8 (build) +- [ ] AC #9 (unit tests) +- [ ] AC #10 (integration test) +- [ ] AC #11 (lint) +- [ ] AC #12 (typecheck) +- [ ] AC #13 (docs) diff --git a/docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md b/docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md new file mode 100644 index 00000000..37f3e24c --- /dev/null +++ b/docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md @@ -0,0 +1,246 @@ +--- +id: 015 +slug: router-job-dispatch-failure-recovery +plan: 2 +plan_slug: wait-for-slot-and-retry-classifier +level: plan +parent_spec: docs/specs/015-router-job-dispatch-failure-recovery.md +depends_on: [1-failed-event-lock-compensation.md] +status: pending +--- + +# 015/2: Wait-for-slot capacity, retry budget, error classifier + +> Part 2 of 2 in the 015-router-job-dispatch-failure-recovery plan. See [parent spec](../../specs/015-router-job-dispatch-failure-recovery.md). + +## Summary + +This plan closes the **lost-job** half of the spec's bug class. It replaces the synchronous "throw on capacity" pattern in the dispatch path with an in-process slot-waiter that suspends until a worker slot frees up (bounded by a timeout that surfaces a real Sentry-visible error). It also adds bounded retry-with-backoff semantics to both queues, gated by an error classifier that decides whether a thrown error is transient (Docker daemon hiccup, image-pull rate-limit, container-name race — retry up to a small budget) or terminal (validation failure, missing credentials, fallback image not found — fail fast with no retry). + +After this plan ships, a card moved to a triggering state while the worker pool is at capacity reliably runs the agent as soon as capacity frees, with no further user action and no permanently failed BullMQ entry. A briefly unreachable Docker daemon resolves on retry. A sustained Docker outage exhausts the retry budget within minutes and surfaces a clear failed-run record + Sentry capture — never silent loss. + +This plan also delivers the AC #9 contract change: the existing `'processFn throws when at capacity'` test (`tests/unit/router/worker-manager.test.ts:179`) is **replaced**, not deleted — its assertion flips from "throws" to "awaits a slot, then dispatches when one frees." Plan 1's failed-event compensator stays as the safety net underneath. + +The CLAUDE.md update covering both halves of the new contract (decision-reason taxonomy from Plan 1 + dispatch retry contract from Plan 2) lands here so the documented invariant is consistent. + +**Components delivered:** +- A small in-house slot-waiter (semaphore-style counter + queued resolvers, ~30 lines, zero new deps) that integrates with `getActiveWorkerCount()` and `routerConfig.maxWorkers`. +- `guardedSpawn` rewritten: instead of throwing on capacity, awaits the slot (bounded timeout); on timeout, throws an explicitly-tagged error that the retry path can act on. +- A dispatch-error classifier `classifyDispatchError(err)` that returns `'transient' | 'terminal'`; terminal errors are wrapped in BullMQ's `UnrecoverableError` so they bypass the retry budget. +- `attempts` raised on both queue defaults with exponential backoff configured (concrete numbers below); per-job overrides stay possible for future use. +- Replacement of the existing capacity-throw test; new tests for slot-waiter + classifier + retry-on-transient + fail-fast-on-terminal + sustained-outage-surfaces-error. +- CLAUDE.md passage updated to document the new dispatch-retry + decision-reason contract. + +**Deferred (out of spec):** +- Worker-pool autoscaling, BullMQ migration, lock-semantics changes, snapshot-fallback redesign, manual-run bypass changes, failed-set UI, startup re-enqueue sweep — all explicit non-goals in the spec. + +--- + +## Spec ACs satisfied by this plan + +- Spec AC #1 (capacity wait → run starts when capacity frees, no permanently failed job) — **full** +- Spec AC #2 (transient Docker error → retry succeeds, no permanently failed job) — **full** +- Spec AC #3 (sustained outage → run marked failed, Sentry captured, no silent loss) — **full** +- Spec AC #9 (test contract change — old throw-on-capacity replaced) — **full** (Plan 1 added new tests; Plan 2 replaces the existing one) + +Regression-tested but not newly satisfied: +- Spec AC #4, #5, #6 (lock-leak / decision reason — already delivered by Plan 1; Plan 2 adds tests confirming the new failure modes don't reintroduce leaks) +- Spec AC #7, #8 (manual-run bypass + clean-exit — Plan 1 pinned these; Plan 2's tests must keep them green) + +--- + +## Depends On + +- Plan 1 (`failed-event-lock-compensation`) — provides the failed-event compensation hook on both queues. Plan 2 introduces new failure modes (slot-wait timeout, retry exhaustion, terminal errors via `UnrecoverableError`); each of those failure paths flows through BullMQ's `'failed'` event and relies on Plan 1's compensator releasing locks. Without Plan 1, Plan 2's new exhaustion path would re-create the wedged-lock symptom. + +--- + +## Detailed Task List (TDD) + +### 1. Slot-waiter primitive + +**Tests first** (`tests/unit/router/slot-waiter.test.ts` — new file): + +- `acquire — resolves immediately when capacity is below max` — unit — set `maxWorkers = 3`; mock `getActiveWorkerCount` to return `1`; call `acquireSlot()`; assert it resolves within the same microtask. Expected red: `Error: Cannot find module './slot-waiter'`. +- `acquire — suspends when at capacity, resolves when a slot frees` — unit — `maxWorkers = 1`, count starts at `1`; call `acquireSlot()`; assert promise is pending; emit a `slotReleased()` notification; assert promise resolves. Expected red: module-not-found, then "promise never resolves" timeout. +- `acquire — bounded timeout rejects with a tagged error if no slot frees in time` — unit — at capacity, call `acquireSlot({ timeoutMs: 50 })`; assert it rejects with `error.code === 'SLOT_WAIT_TIMEOUT'` after ~50ms. Expected red: never rejects (no timeout implemented). +- `acquire — multiple waiters resolve FIFO as slots free` — unit — at capacity, queue 3 waiters; emit `slotReleased()` once; assert exactly one waiter resolves and the other two stay pending; emit again; assert the second resolves. Expected red: all three resolve simultaneously, or none. +- `acquire — slotReleased called with no waiters is a no-op` — unit — call `slotReleased()` when nothing is waiting; assert no throw. Expected red: same as first test. +- `slotReleased — does not double-release waiters when called multiple times` — unit — queue 1 waiter; call `slotReleased()` twice; assert the waiter resolves exactly once (no `unhandledRejection` from a double-resolution path). Expected red: waiter rejected, or never resolves. + +**Implementation** (`src/router/slot-waiter.ts`): +- Internal queue `pendingResolvers: Array<{ resolve: () => void; reject: (e: Error) => void; timeoutHandle: NodeJS.Timeout }>`. +- Export `acquireSlot(opts: { timeoutMs: number }): Promise` — checks `getActiveWorkerCount() < routerConfig.maxWorkers`; if true, resolves immediately; otherwise pushes a resolver onto the queue with a `setTimeout` that rejects with a tagged `Error & { code: 'SLOT_WAIT_TIMEOUT' }`. +- Export `slotReleased(): void` — pops + resolves the head of the queue, clearing its timeout. No-op if empty. +- Export `clearAllWaiters()` for test cleanup + router shutdown (rejects pending waiters with a tagged `code: 'SHUTDOWN'`). +- Use a `code` field instead of error-message string-matching so the retry classifier doesn't depend on free-form text. + +### 2. `guardedSpawn` rewrite — wait-for-slot + +**Tests first** (`tests/unit/router/worker-manager.test.ts` — REPLACE the existing line-179 test): + +- (replace) `processFn awaits a slot when at capacity, then dispatches when one frees` — unit — `maxWorkers = 1`, mock `getActiveWorkerCount` to return `1` initially, `0` after a tick; start `processFn(fakeJob)`; assert `mockSpawnWorker` not called yet; advance fake timers + emit `slotReleased`; assert `mockSpawnWorker` is now called. Expected red: today's contract throws synchronously — assertion `mockSpawnWorker.toHaveBeenCalled()` fails. (Removes the existing `'processFn throws when at capacity'` assertion.) +- `processFn rejects with code: 'SLOT_WAIT_TIMEOUT' when the wait exceeds the configured ceiling` — unit — at-capacity for the entire wait window; assert promise rejects with the tagged error. Expected red: today rejects with the untagged `Error('No worker slots available')`. +- `processFn calls slotReleased after spawnWorker resolves` — unit — happy path; assert `mockSlotReleased` is invoked once after a successful spawn. (Note: today, slot accounting is implicit via container-exit; we add an explicit `slotReleased` so the waiter queue advances even when many spawns happen back-to-back without container exits in between.) Expected red: spy not called. +- `processFn — when spawnWorker throws, the slot is still released so other waiters proceed` — unit — `mockSpawnWorker` rejects; assert `mockSlotReleased` is still invoked (try/finally semantics) and the rejection propagates. Expected red: `slotReleased` not called → next waiter never proceeds. + +**Implementation** (`src/router/worker-manager.ts`): +- Replace `guardedSpawn`'s body with: `await acquireSlot({ timeoutMs: routerConfig.slotWaitTimeoutMs })`; then `try { await spawnWorker(job); } finally { /* note: existing exit path handles slotReleased on container exit; the dispatcher does NOT call it here, see decision below */ }`. +- **Decision**: do NOT call `slotReleased()` from `guardedSpawn` itself. The slot is conceptually held by the running container, not by the dispatcher. `slotReleased()` is called once from `cleanupWorker` (in `active-workers.ts`) at container-exit time. This keeps a single source of truth for "slot freed = container exited." The test "calls slotReleased after spawnWorker resolves" above is therefore reframed: it asserts slot accounting is wired through container-exit, not through dispatcher exit. Update that test accordingly. (Stated explicitly so /implement does not get confused by the initial draft.) +- New config field `slotWaitTimeoutMs` in `routerConfig` (default e.g. `5 * 60 * 1000` = 5min — enough for typical worker runs to complete, short enough that a stuck pool surfaces within a webhook receiver's life). Document the default in the config module. + +### 3. Hook `slotReleased` into `cleanupWorker` + +**Tests first** (`tests/unit/router/active-workers.test.ts` — extend): + +- `cleanupWorker — calls slotReleased exactly once per cleanup` — unit — register a worker; call `cleanupWorker(jobId)`; assert spy called once. Expected red: spy not called. +- `cleanupWorker — calls slotReleased even on the crash path (exitCode != 0)` — unit — call with `cleanupWorker(jobId, 137, { oomKilled: true })`; assert spy still called once. Expected red: spy not called. +- `cleanupWorker — does NOT double-call slotReleased on duplicate cleanup invocations` — unit — call `cleanupWorker(jobId)` twice (e.g. timeout-then-exit race); assert spy called exactly once. Expected red: spy called twice (the existing `if (worker)` guard handles this if we put `slotReleased` inside the same block). + +**Implementation** (`src/router/active-workers.ts`): +- Inside the existing `if (worker) { ... }` block in `cleanupWorker`, after the existing lock-release calls and before the `activeWorkers.delete(jobId)` line, call `slotReleased()`. The existing `if (worker)` guard already ensures idempotence. + +### 4. Dispatch-error classifier + +**Tests first** (`tests/unit/router/dispatch-error-classifier.test.ts` — new file): + +- `classifyDispatchError — Docker daemon unreachable error → 'transient'` — unit — synthesize an error with `err.code === 'ECONNREFUSED'` and a Dockerode-shaped message; assert classifier returns `'transient'`. Expected red: module-not-found. +- `classifyDispatchError — image-pull rate-limit (HTTP 429 from registry) → 'transient'` — unit — synthesize an error matching the dockerode rate-limit shape; assert `'transient'`. Expected red: as above. +- `classifyDispatchError — container name collision race ("name already in use") → 'transient'` — unit — synthesize the dockerode 409 conflict error; assert `'transient'`. Expected red: as above. +- `classifyDispatchError — image not found after fallback (the `isImageNotFoundError` path that already throws today) → 'terminal'` — unit — pass an error matching `isImageNotFoundError()`; assert `'terminal'`. Expected red: as above. +- `classifyDispatchError — validation error (e.g. missing credentials, malformed config) → 'terminal'` — unit — pass a non-Docker `TypeError`-shaped error; assert `'terminal'`. Expected red: as above. +- `classifyDispatchError — slot-wait timeout (code: 'SLOT_WAIT_TIMEOUT') → 'transient'` — unit — pass the slot-waiter timeout error; assert `'transient'`. The slot-wait timeout itself is treated as transient because the next retry will likely find a slot; only repeated timeouts within the retry budget surface as a real outage. Expected red: as above. +- `classifyDispatchError — unknown error (no recognizable shape) → 'transient'` — unit — pass `new Error('something weird')`; assert `'transient'` (default-to-retry is the safer choice; a true bug surfaces via attempt exhaustion). Expected red: as above. + +**Implementation** (`src/router/dispatch-error-classifier.ts`): +- Export `classifyDispatchError(err: unknown): 'transient' | 'terminal'`. +- Recognize: `ECONNREFUSED`, `ECONNRESET`, `ENOTFOUND` on the Docker socket → transient; HTTP 429 from registry → transient; HTTP 409 "name already in use" → transient; the existing `isImageNotFoundError(err)` shape (after fallback already exhausted) → terminal; explicit `TypeError` / `ZodError` shapes → terminal; `code: 'SLOT_WAIT_TIMEOUT'` → transient; default → transient. +- Re-use the existing `isImageNotFoundError` predicate from `container-manager.ts` (export it if not already exported). + +### 5. Wire classifier into `guardedSpawn` + retry config on both queues + +**Tests first** (`tests/unit/router/worker-manager.test.ts` — extend): + +- `processFn — when spawnWorker rejects with a transient error, propagates the rejection unchanged so BullMQ retries via attempts/backoff` — unit — mock spawnWorker to reject with an `ECONNREFUSED`-shaped error; assert processFn rejects with the same error (not wrapped in `UnrecoverableError`). Expected red: today's processFn does not differentiate, so this test should pass as a baseline; it pins behavior so Plan 2 doesn't accidentally swallow transient errors. +- `processFn — when spawnWorker rejects with a terminal error, wraps in BullMQ's UnrecoverableError so retries are skipped` — unit — mock spawnWorker to reject with a `ZodError`-shaped error (or a wrapped image-not-found); assert processFn rejects with `instanceof UnrecoverableError`. Expected red: rejection is the original error, not `UnrecoverableError`. + +**Tests first** (`tests/integration/router/dispatch-retry.test.ts` — new file): + +- `cascade-jobs queue — retries a transient dispatch failure with backoff and eventually succeeds` — integration — push a real job; mock spawnWorker to reject with an `ECONNREFUSED`-shaped error on attempt 1, succeed on attempt 2; assert the job ultimately moves to `completed` state, not `failed`. Expected red: job moves to `failed` after attempt 1 (current `attempts: 1` behavior). +- `cascade-jobs queue — terminal dispatch failure does NOT retry` — integration — mock spawnWorker to reject with a `ZodError`; assert exactly one attempt, job moves to `failed`. Expected red: with `attempts > 1` but no UnrecoverableError wrap, BullMQ would retry — this test catches that mistake. +- `cascade-jobs queue — sustained transient outage exhausts retries within the configured window and surfaces failure` — integration — mock spawnWorker to always reject with `ECONNREFUSED`; assert exactly N attempts (matching the configured budget), final state `failed`, and that the failed-event compensator from Plan 1 runs once at exhaustion (verifies the compose with Plan 1 works). Expected red: missing retry, or compensator running multiple times during retries. +- `cascade-dashboard-jobs queue — same retry behavior on a manual-run job` — integration — mirror of the first test against the dashboard queue. Expected red: dashboard queue still has `attempts: 1`, no retries. + +**Implementation**: +- In `src/router/worker-manager.ts`'s `guardedSpawn`, wrap the `spawnWorker` call in `try { await spawnWorker(job); } catch (err) { if (classifyDispatchError(err) === 'terminal') throw new UnrecoverableError(String(err)); throw err; }`. (The slot is released by the container-exit path, not here.) +- In `src/router/queue.ts` and `src/queue/client.ts`, change `defaultJobOptions.attempts` from `1` to **4** and add `backoff: { type: 'exponential', delay: 5_000 }`. Concrete budget: attempts 1 (immediate) + retry at ~5s + ~10s + ~20s + ~40s = total wait ~75s. Sustained-outage detection within ~90s. Tunable via env or config in a future patch; keep them as plain literals for now. +- The retry attempts apply to the entire processFn invocation, including the slot-wait. A retry attempt that re-acquires the slot is normal behavior; a retry attempt whose slot-wait also times out classifies as transient and proceeds to the next retry. + +### 6. CLAUDE.md update + CHANGELOG entry + +**Tests first**: n/a — documentation change (no tests, but `/implement`'s Phase 6 verifies the file changed). + +**Implementation**: +- Edit the existing CLAUDE.md "Worker exit diagnostics" / "Work-item concurrency lock" section. Add a new short subsection titled "Dispatch failure semantics" capturing: + - Capacity miss now waits for a slot (bounded timeout, default 5min). + - Transient Docker errors retry with exponential backoff up to 4 attempts (~75s total). + - Terminal errors (validation, image-not-found-after-fallback) skip retries via `UnrecoverableError`. + - Failed-event compensation releases work-item / agent-type / dedup locks on every dispatch failure path. + - Webhook decision reasons split: `Job queued` / `Awaiting worker slot: …` / `Work item locked (no active dispatch): …`. The third is a regression canary — its presence in webhook logs means a code path acquired a lock without registering its compensation. +- `CHANGELOG.md`: entry under the next release: "Router: dispatch capacity miss now waits for a worker slot instead of throwing; transient Docker errors retry with exponential backoff; terminal errors skip retries. Combined with plan 015/1, the original silent black-hole failure mode is closed." + +--- + +## Test Plan + +### Unit tests +- [ ] `tests/unit/router/slot-waiter.test.ts` (new): 6 tests +- [ ] `tests/unit/router/worker-manager.test.ts` (replace + extend): 1 replacement + 2 new tests +- [ ] `tests/unit/router/active-workers.test.ts` (extend): 3 new tests for `slotReleased` integration with `cleanupWorker` +- [ ] `tests/unit/router/dispatch-error-classifier.test.ts` (new): 7 tests covering the transient/terminal taxonomy + +### Integration tests +- [ ] `tests/integration/router/dispatch-retry.test.ts` (new): 4 tests covering `cascade-jobs` + `cascade-dashboard-jobs` × transient retry succeeds + terminal skips + sustained exhausts. + +### Acceptance tests +- [ ] AC #1: integration test "transient capacity → wait then dispatch when slot frees" (in slot-waiter integration scenarios) +- [ ] AC #2: integration test "transient Docker → retry → success" +- [ ] AC #3: integration test "sustained outage → retry exhaustion → marked failed + Sentry" +- [ ] AC #9: the existing capacity-throw test is replaced; CI fails if both old and new assertions are present. + +### Regression coverage (Plan 1 ACs) +- [ ] All Plan 1 integration tests in `tests/integration/router/dispatch-failure-compensation.test.ts` continue to pass against the new dispatch contract. The existing tests already cover the lock-release path; Plan 2's new failure modes (slot-wait timeout, retry exhaustion, UnrecoverableError) all flow through the same `'failed'` event hook, so the existing tests should pass without modification. If they need adjustment, that's a sign the compensator needs broader extractor coverage — handle it then. + +--- + +## Manual Verification (for `[manual]`-tagged ACs only) + +n/a — all ACs auto-tested. + +--- + +## Acceptance Criteria (per-plan, testable) + +1. The slot-waiter primitive resolves immediately when capacity is below max, suspends when at capacity, resolves on `slotReleased`, and times out cleanly with a tagged error. +2. `guardedSpawn` no longer throws on capacity; it waits for a slot up to `slotWaitTimeoutMs`. +3. `cleanupWorker` calls `slotReleased` exactly once per cleanup, including on the crash path. +4. `classifyDispatchError` correctly partitions transient vs terminal errors per the taxonomy above; the slot-wait timeout itself classifies as transient. +5. Both `cascade-jobs` and `cascade-dashboard-jobs` queue defaults specify `attempts: 4` with exponential backoff `delay: 5000`. +6. Terminal errors are wrapped in `UnrecoverableError` so BullMQ skips retries. +7. A transient dispatch failure on either queue retries and eventually succeeds; a terminal failure does not retry; a sustained outage exhausts the budget within ~90s and surfaces a failed run + Sentry capture. +8. The previously-existing `'processFn throws when at capacity'` test is replaced (not deleted) with the wait-for-slot contract. CI must not have both assertions present. +9. CLAUDE.md is updated with the new dispatch-failure-semantics passage covering both halves of the contract (Plan 1 + Plan 2). +10. CHANGELOG.md has the Plan 2 entry. +11. Plan 1's integration tests continue to pass against the new dispatch contract (regression). +12. All new/modified code has corresponding tests written before the implementation. +13. `npm run build` passes. +14. `npm test` passes. +15. `npm run test:integration` passes. +16. `npm run lint` passes. +17. `npm run typecheck` passes. + +--- + +## Documentation Impact (this plan only) + +| File | Change | +|---|---| +| `CLAUDE.md` | New subsection (under the existing "Worker exit diagnostics" / lock section) titled "Dispatch failure semantics" — captures wait-for-slot contract, retry budget + classifier, failed-event compensation, three-way decision-reason taxonomy. | +| `CHANGELOG.md` | Entry under the next release describing wait-for-slot + retry-with-backoff + classifier. References the spec ID 015. | + +--- + +## Out of Scope (this plan) + +- Worker-pool autoscaling / `maxWorkers` adjustments — explicit non-goal in spec. +- BullMQ → another queue migration — explicit non-goal in spec. +- Work-item-lock semantics — explicit non-goal in spec (one-per-type stays). +- Snapshot reuse / fallback-to-base-image logic — explicit non-goal in spec. +- Manual-run lock-bypass behavior changes — explicit non-goal in spec. +- Failed-set inspection UI — explicit non-goal in spec. +- Startup re-enqueue sweep that picks up jobs already in the failed set — explicit non-goal in spec. +- Cleanup of the existing dead `linear-1777217350854-2qvhjo` job — operational, out of scope per spec. +- Cross-router-instance lock coordination — out of scope per spec; existing DB-fallback continues to handle restarts. + +--- + +## Progress + + +- [ ] AC #1 (slot-waiter primitive) +- [ ] AC #2 (guardedSpawn awaits) +- [ ] AC #3 (cleanupWorker calls slotReleased) +- [ ] AC #4 (classifier taxonomy) +- [ ] AC #5 (queue retry config) +- [ ] AC #6 (UnrecoverableError wrap) +- [ ] AC #7 (end-to-end retry behavior) +- [ ] AC #8 (capacity-throw test replaced) +- [ ] AC #9 (CLAUDE.md updated) +- [ ] AC #10 (CHANGELOG) +- [ ] AC #11 (Plan 1 regression) +- [ ] AC #12 (TDD discipline) +- [ ] AC #13 (build) +- [ ] AC #14 (unit tests) +- [ ] AC #15 (integration) +- [ ] AC #16 (lint) +- [ ] AC #17 (typecheck) diff --git a/docs/plans/015-router-job-dispatch-failure-recovery/_coverage.md b/docs/plans/015-router-job-dispatch-failure-recovery/_coverage.md new file mode 100644 index 00000000..18ffb128 --- /dev/null +++ b/docs/plans/015-router-job-dispatch-failure-recovery/_coverage.md @@ -0,0 +1,39 @@ +# Coverage map for spec 015-router-job-dispatch-failure-recovery + +Auto-generated by /plan. Tracks which plans satisfy which spec ACs. + +## Spec ACs + +| # | Spec AC (short) | Satisfied by | Status | +|---|---|---|---| +| 1 | Capacity-busy → run starts when capacity frees, no permanently failed job | plan 2 (wait-for-slot-and-retry-classifier) | full | +| 2 | Transient Docker error → retry succeeds, no permanently failed job | plan 2 | full | +| 3 | Sustained outage → run marked failed, Sentry capture, no silent loss | plan 2 | full | +| 4 | No stranded lock after any dispatch failure | plan 1 (failed-event-lock-compensation) | full | +| 5 | Decision reason distinguishes in-flight from wedged-lock | plan 1 | full | +| 6 | Wedged-lock canary never fires under normal operation | plan 1 | full | +| 7 | Manual-run path bypass + dispatch unchanged | plan 1 (full) + plan 2 (regression) | full | +| 8 | Clean-exit path identical to today | plan 1 (full) + plan 2 (regression) | full | +| 9 | Existing throw-on-capacity test replaced (not deleted) with wait-for-slot contract | plan 1 (adds new failed-event tests) + plan 2 (replaces the existing test) | partial chain | + +## Coverage summary + +- **9 spec ACs** mapped to **2 plans** +- **8 plans-x-AC pairs full coverage** (each AC fully satisfied by its assigned plan(s)) +- **1 spec AC** with partial-chain coverage (AC #9 — Plan 1 grows test coverage; Plan 2 replaces the contract-bearing assertion) +- All non-goals from the spec's Out of Scope are mirrored in each plan's Out of Scope section. + +## Plan dependency graph + +``` +1-failed-event-lock-compensation ──→ 2-wait-for-slot-and-retry-classifier +``` + +Plan 1 ships the safety-net (compensation + decision-reason taxonomy) independently. Plan 2 introduces the dispatch-contract change (wait-for-slot + retry/classifier) and depends on Plan 1's compensator being live so any new failure modes don't re-create the wedged-lock symptom. After Plan 1 lands, prod's user-visible damage is half-fixed (no more wedged locks); after Plan 2 lands, the spec's bug class is closed (no more lost jobs). + +## Documentation impact distribution + +| Spec doc | Plan | +|---|---| +| `CLAUDE.md` | Plan 2 (single passage covering both halves of the new contract — avoids in-flight inconsistency from a Plan-1-only update) | +| `CHANGELOG.md` | Plan 1 + Plan 2 (one entry per shipping plan; each describes its half of the fix and references spec ID 015) | diff --git a/docs/specs/015-router-job-dispatch-failure-recovery.md b/docs/specs/015-router-job-dispatch-failure-recovery.md new file mode 100644 index 00000000..cbcf519c --- /dev/null +++ b/docs/specs/015-router-job-dispatch-failure-recovery.md @@ -0,0 +1,130 @@ +--- +id: 015 +slug: router-job-dispatch-failure-recovery +level: spec +title: Router job dispatch failure recovery +created: 2026-04-26 +status: draft +--- + +# 015: Router job dispatch failure recovery + +## Problem & Motivation + +CASCADE's router has a silent black-hole failure mode. When a webhook-driven job is pulled off the queue and the dispatcher cannot immediately spawn a worker — because every worker slot is already occupied by a prior run, or because the Docker daemon hiccups, or because an image-pull rate-limit fires — the dispatcher throws synchronously. With the queue's current "attempts: 1, no retry" defaults, that single throw moves the job straight to the failed set in Redis. It is never picked up again, even after capacity frees up seconds later. The user gets no error in the UI; the agent never runs; the work item appears to have been silently ignored. + +The damage doesn't stop at one lost job. The webhook handler establishes in-memory bookkeeping — a same-type-per-work-item concurrency lock, an agent-type concurrency counter, and a recently-dispatched dedup mark — *between* the moment it puts the job on the queue and the moment a worker container actually starts. The lock-release logic only fires from the worker-exit path, which itself only runs when a worker container was actually started. A spawn-time throw never reaches that path, so all three lock entries leak. The work-item lock in particular has a 30-minute TTL and only auto-clears on router restart. For 30 minutes after a transient capacity miss, every subsequent webhook for that work item + agent type is silently rejected with the misleading reason `"Work item locked: 1 enqueued (max 1 per type)"`. The user, seeing no agent run after dragging a card to Todo, drags it again — and again — with no feedback that the lock is wedged. + +This was hit in production on 2026-04-26 against the ucho project (Linear card MNG-350). A user moved the card to Todo at 15:29:07 UTC. The job was enqueued and pulled by BullMQ; the only worker slot was held by an unrelated MNG-354 implementation that had spawned ~2 seconds earlier. The MNG-350 dispatcher threw "No worker slots available" and the job was failed. The user re-moved the card to Backlog→Todo three more times across the next 18 minutes — every webhook returned the misleading "locked" decision. The card sat dead until a manual CLI trigger force-dispatched through a separate path that bypasses the lock. This is a credibility-class incident for an automation product whose entire value proposition is "PM card moves automatically translate into agent runs." Unattended workflows must not silently break. + +The fix needs three contract changes that work in concert: (1) transient dispatch failures must retry, with bounded attempts and backoff so a real outage isn't masked, (2) any in-memory lock state acquired at enqueue time must be released the moment dispatch fails, before any subsequent webhook can be misled by it, and (3) the webhook decision-reason taxonomy must let users (and operators reading webhook logs) distinguish a healthy in-flight run from a wedged lock that needs intervention. + +--- + +## Goals + +1. A transient over-capacity condition during dispatch never causes a job to be permanently lost. The job either backpressures naturally until a slot is available, or is retried with bounded attempts and is only declared dead after the bound is exhausted. +2. A transient Docker-side spawn error (daemon unreachable, image-pull rate-limit, container-name race) is retried within a bounded attempt budget. A terminal error (e.g. validation failure, missing credentials, a fallback image that genuinely doesn't exist) fails fast without burning the retry budget. +3. Every in-memory lock entry acquired during the webhook → enqueue path is released when the dispatch ultimately fails, before the next webhook for the same work item arrives. No 30-minute wedge is possible regardless of which exception path the dispatcher took. +4. Webhook decision logs distinguish healthy in-flight state ("queued, awaiting slot") from wedged-lock state ("lock count says enqueued but no active dispatch can be found"). Users moving a card see decision reasons that describe what is actually happening. +5. Both the webhook-driven job queue and the dashboard manual-run queue benefit from the same retry symmetry. The work-item lock bypass on manual-run remains untouched (intentional escape hatch), but neither queue should permanently lose a job to a single transient dispatch failure. +6. Existing healthy paths (successful spawn, successful exit, manual-run, snapshot fallback, agent-type concurrency limits) continue to behave identically. The fix is failure-path symmetry, not a redesign of dispatch. + +--- + +## Non-goals + +- Redesigning worker-pool sizing or how `maxWorkers` is computed. +- Overhauling the BullMQ concurrency model or moving off BullMQ. +- Changing the work-item-lock semantics (one same-type agent per work item per project remains the rule, per spec 007). +- Modifying the snapshot reuse / fallback-to-base-image logic. +- Changing the dashboard manual-run lock-bypass behavior — manual runs are the user's escape hatch and stay that way. +- Building a UI surface for failed-set inspection. Webhook log decision reasons are the user-facing surface for now. +- Backfilling a re-dispatch sweep at router startup that picks up jobs already dead in the failed set (operational cleanup; out of scope for the contract change). + +--- + +## Constraints + +- The fix must not introduce new race windows. The point at which lock state is acquired and the point at which it is released must remain consistent across all dispatch outcomes (success, transient failure with eventual retry success, transient failure exhausting retries, terminal failure). +- Retry budgets must be small enough that a true Docker outage is surfaced within minutes, not hours. A transient blip should be invisible; a sustained outage should still page someone through Sentry within a few minutes. +- The dispatcher must not block forever waiting for a worker slot if the system is genuinely stuck. Any wait-for-slot mechanism needs a timeout that surfaces a Sentry-visible error, not a silent hang. +- Job-data carried into the retry path must remain identical across attempts. A retry must dispatch the same work item, same agent type, same ack info — not a re-derived version that could disagree with what the first attempt registered locks for. +- The change is restricted to the router's dispatch surface. Trigger-handler logic, agent input building, and the worker container's runtime contract are unchanged. + +--- + +## User stories / Requirements + +1. **As a CASCADE user**, when I move a Linear/Trello/JIRA card to a state that triggers an agent, the agent run starts even if the worker pool is briefly at capacity. I never have to drag the card a second time to "wake the system up." +2. **As a CASCADE user**, when something is genuinely wrong with the worker fleet (Docker down, host out of memory), I get a clear failure signal — either an ack comment that the run failed or an error visible in the dashboard — rather than a silently dropped job. +3. **As an operator reading webhook logs during an incident**, I can tell the difference between "this work item has 1 active run in progress, the new webhook was correctly deduped" and "this work item's lock count is non-zero but nothing is actually running — the lock is wedged." +4. **As a router process**, when I throw during dispatch, I leave no in-memory lock entry behind that could mislead a future webhook for the same work item + agent type. +5. **As a router process**, when a transient dispatch failure resolves on retry, the job runs to completion exactly once. The lock state at run-start matches the lock state at run-end. +6. **As a router process restarting after a crash**, I do not strand work-item locks across the restart. (Already true today via in-memory locks living only in process memory; this spec must not regress that.) + +--- + +## Research Notes + +- BullMQ's worker model treats `processFn` rejection as "job failed" and consults the job's `attempts` and `backoff` options to decide whether to retry. With `attempts: 1` and no backoff, any rejection moves the job to the failed set permanently. ([BullMQ docs — Retrying failing jobs](https://docs.bullmq.io/guide/retrying-failing-jobs)) +- BullMQ exposes a `worker.on('failed', ...)` event that fires after a job's final retry exhaustion. This is the natural seam for compensating in-memory state established outside BullMQ's transactional view (locks, dedup marks). The event receives the job and the error, sufficient to identify what to release. +- Semaphore-style backpressure (P-Queue, Throat, Bottleneck) is the standard pattern for "wait-for-slot" semantics that integrate cleanly with async functions. The trade-off vs throw-and-retry is fewer failed-set entries at the cost of holding more open async tasks. For our use case (≤10 concurrent slots, sub-second wait granularity, retry attempts cost a Redis round-trip each), wait-for-slot is the cleaner fit. +- BullMQ's `concurrency` setting controls how many `processFn` invocations are in flight, not how many "work items" are running. Because our `processFn` resolves on container *start*, not container *exit*, BullMQ's concurrency does not function as a worker-pool cap. This is the root cause of the throw-on-capacity pattern existing in the first place: someone's safety net for a mismatch the queue isn't aware of. The spec restores correctness by closing that mismatch on the dispatch side. +- "Compensating actions" (release locks on dispatch failure) is a long-standing pattern in transaction processing — see Garcia-Molina & Salem's "Sagas" (1987). The applicable rule: every action that acquires durable state outside the local transaction must register a compensator that runs on rollback. In our case, the rollback trigger is the BullMQ failed event. + +--- + +## Open Source Decisions + +| Tool | Solves | Decision | Reason | +|------|--------|----------|--------| +| [BullMQ retry/backoff](https://docs.bullmq.io/guide/retrying-failing-jobs) | Bounded retry of transient dispatch failures | **Use** | Already in use; configuring `attempts` + `backoff` on the existing queue is the smallest possible change for the retry contract. | +| [BullMQ Worker `failed` event](https://docs.bullmq.io/guide/events) | Cleanup hook for in-memory lock state on dispatch failure | **Use** | The only reliable place to compensate state for *any* dispatch-path exception, including ones that don't exist yet. Already wired for logging — we extend the existing handler. | +| Semaphore library (e.g. [async-sema](https://github.com/vercel/async-sema), [p-limit](https://github.com/sindresorhus/p-limit)) | Wait-for-slot backpressure inside the dispatch processFn | **Skip (default), revisit during /plan** | A small in-house counter with a queue of resolvers is sufficient and has zero new deps. Plan can reverse this if it finds a real reason to adopt a library. | + +--- + +## Strategic decisions + +1. **Capacity miss is handled by wait-for-slot, not throw-and-retry.** The dispatcher awaits a slot up to a bounded timeout instead of throwing. Reason: eliminates the failed-set churn for capacity entirely, leaving retries reserved for genuine Docker errors. If the wait-for-slot timeout itself trips, it surfaces as a real error (Sentry-visible) and a single retry attempt — not silent loss. +2. **Retries are reserved for transient Docker-side errors.** A small bounded retry budget with exponential backoff applies to Docker daemon unreachable, image-pull rate-limit, and container-name collision races. Terminal errors (validation, missing credentials, fallback image not found) fail fast on the first attempt. The classifier lives at the boundary of the dispatcher; the spec defines the dichotomy, the plan picks the implementation. +3. **Lock compensation runs from the queue's failed event.** Cleanup hooks attached to the BullMQ failed event read the original job's `(projectId, workItemId, agentType)` payload and release every lock entry that the enqueue path established. This catches every dispatch-path exception, including paths that don't exist today. We do not rely on per-throw try/finally inside individual dispatch functions — that approach is fragile and depends on each new code path remembering to compensate. +4. **Webhook decision reasons are split into three states.** "Job queued" (success), "Awaiting worker slot" (queued behind N active runs, healthy), and "Work item locked (no active dispatch)" (wedged-lock state, requires correlating in-memory lock count with active worker registry + BullMQ queue state). The third reason exists specifically as a diagnostic for operators and as a safety net to confirm the compensation path is working — if it ever fires after this spec ships, that's a regression. +5. **Both queues get the retry treatment; only one keeps the lock-release path.** The dashboard manual-run queue and the webhook job queue both get the retry-on-transient-failure contract. The manual-run path's existing lock-bypass behavior is preserved unchanged — manual runs are the user's escape hatch when something else is wedged. +6. **The currently-stuck job in production is operational cleanup, not in scope.** The dead `linear-1777217350854-2qvhjo` job in the failed set is removed by hand. This spec ensures *future* failures don't strand jobs; an automatic startup sweep that re-enqueues failed-set entries is a deliberate non-goal — surfacing those errors loudly is more valuable than silently re-attempting them. + +--- + +## Acceptance Criteria (outcome-level) + +1. A user moves a PM card to a triggering state while the worker pool is at capacity. The agent run starts when capacity frees, with no further user action. The same flow does not produce a permanently failed job in BullMQ. +2. A user moves a PM card to a triggering state while the Docker daemon is briefly unreachable (e.g. socket reconnects within a few seconds). The agent run starts on retry. The same flow does not produce a permanently failed job in BullMQ. +3. A user moves a PM card to a triggering state while Docker is genuinely down (sustained outage exceeding the retry budget). The webhook is acknowledged, the run is marked failed in the dashboard, and a Sentry error is captured. No silent loss. +4. After any dispatch failure path — capacity-wait timeout, transient retry exhaustion, terminal error — a webhook for the same work item + agent type that arrives immediately afterward is **not** rejected with a "locked" decision reason caused by a stranded in-memory lock. The lock state matches the actual worker registry state. +5. While a worker is actively dispatching or running for a work item, a duplicate webhook for the same work item + agent type is rejected with a decision reason that distinguishes "in-flight, healthy" from "wedged lock." An operator inspecting webhook logs can tell which is which without reading source. +6. The wedged-lock decision reason never fires under normal operation. Its presence in webhook logs is itself a signal that compensation has missed a path — i.e. it is intended as a diagnostic invariant, not a routine state. +7. The dashboard manual-run path continues to bypass the work-item lock and continues to dispatch successfully even when the lock count for that work item + agent type is non-zero. +8. A worker container that successfully dispatches and exits cleanly behaves identically to today: locks released on exit, snapshot committed (where applicable), run record in the dashboard. +9. Existing test coverage for the worker-manager dispatch path is updated to reflect the new contract (specifically, the assertion that "processFn throws when at capacity" is replaced with the new wait-for-slot contract). No test silently passes that previously failed. + +--- + +## Documentation Impact (high-level) + +- `CLAUDE.md` — the project root entry already documents the work-item concurrency lock and the worker exit diagnostics; needs an updated paragraph capturing the new dispatch retry contract and the three-way webhook decision-reason taxonomy. This is load-bearing cross-cutting behavior with no other natural home and is exactly the kind of invariant that decays in CLAUDE.md if not updated alongside the change. +- `CHANGELOG.md` — entry under the next release noting the silent black-hole fix and the new decision-reason vocabulary. + +--- + +## Out of Scope + +- Worker-pool sizing or `maxWorkers` autoscaling. +- BullMQ → another queue migration. +- Work-item-lock semantics (one same-type agent per work item, per spec 007). +- Snapshot reuse and snapshot-image fallback logic. +- Manual-run lock-bypass behavior (intentional, stays). +- Failed-set inspection UI in the dashboard. +- Startup sweep that re-enqueues already-failed jobs from the failed set. +- Cleanup of the existing dead `linear-1777217350854-2qvhjo` entry — operational, handled out-of-band. +- Cross-router-instance lock coordination (locks remain in-process; spec 007 lock semantics already handle this via DB fallback). From e2fb2f8f127b05f679f2d03f09a50bc0c924c40f Mon Sep 17 00:00:00 2001 From: Zbigniew Sobiecki Date: Sun, 26 Apr 2026 16:21:27 +0000 Subject: [PATCH 2/7] chore(plan): lock 015/1 failed-event-lock-compensation Co-Authored-By: Claude Opus 4.7 (1M context) --- ...-compensation.md => 1-failed-event-lock-compensation.md.wip} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename docs/plans/015-router-job-dispatch-failure-recovery/{1-failed-event-lock-compensation.md => 1-failed-event-lock-compensation.md.wip} (99%) diff --git a/docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md b/docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md.wip similarity index 99% rename from docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md rename to docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md.wip index 6606b36f..8d53025f 100644 --- a/docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md +++ b/docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md.wip @@ -6,7 +6,7 @@ plan_slug: failed-event-lock-compensation level: plan parent_spec: docs/specs/015-router-job-dispatch-failure-recovery.md depends_on: [] -status: pending +status: wip --- # 015/1: Failed-event lock compensation + decision-reason taxonomy From f3139a503aa9dfd3018f0f0429f2aae1f85bc811 Mon Sep 17 00:00:00 2001 From: Zbigniew Sobiecki Date: Sun, 26 Apr 2026 16:33:03 +0000 Subject: [PATCH 3/7] =?UTF-8?q?feat(router):=20plan=20015/1=20done=20?= =?UTF-8?q?=E2=80=94=20release=20locks=20on=20dispatch=20failure?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the stranded-lock half of spec 015's bug class verified live in prod on 2026-04-26 (ucho/MNG-350). When a webhook-driven job's dispatch fails — capacity throw, Docker spawn error, or any future throw site — the work-item lock, agent-type concurrency counter, and recently-dispatched dedup mark established by the webhook → enqueue path are now released by a compensator hooked to BullMQ's `worker.on('failed')` event. What landed: - `src/router/dispatch-compensator.ts` (new) — `releaseLocksForFailedJob` wraps `extractProjectIdFromJob` / `extractWorkItemId` / `extractAgentType` and calls into `clearWorkItemEnqueued` / `clearAgentTypeEnqueued` / `clearRecentlyDispatched`. Never propagates errors; captures to Sentry with `tags: { source: 'dispatch_compensator' }`. - `src/router/agent-type-lock.ts` — exports new `clearRecentlyDispatched` for the compensator. The existing `markRecentlyDispatched` semantics are unchanged (60s TTL, NOT cleared on completion); this helper exists solely so a permanently-failed dispatch doesn't keep deduping a fresh webhook for ~60s while the user retries. - `src/router/bullmq-workers.ts` — extends the existing `worker.on('failed')` handler to invoke `releaseLocksForFailedJob` alongside the existing logger + Sentry calls. Wraps the call in a defensive `.catch` so a future regression in the compensator can't poison the worker. - `src/router/lock-state-classifier.ts` (new) — `classifyLockState` returns `'awaiting-slot'` when an active worker or queued/waiting job matches the trio, `'wedged'` when neither correlation matches. Defaults to `'awaiting-slot'` on classifier error so a Redis blip doesn't mis-emit the wedged canary. - `src/router/active-workers.ts` — `getActiveWorkers()` now exposes `(projectId, workItemId, agentType)` so the classifier can correlate. Backwards-compatible (existing callers work unchanged; new fields are additive optional). - `src/router/webhook-processor.ts` — Step 8 (work-item lock check) now splits the decision-reason vocabulary into three states: * `Job queued: ...` (success path) * `Awaiting worker slot: ...` (lock held + dispatch in flight; healthy) * `Work item locked (no active dispatch): ...` (wedged-lock canary) The wedged branch additionally fires `captureException` with `tags: { source: 'wedged_lock_canary' }` so any regression in compensation is loud in production. What this does NOT change (intentional, all in plan 015/2): - `guardedSpawn` still throws on capacity (BullMQ marks the job failed, the compensator now releases the locks, but the job itself is still lost). Plan 2 replaces the throw with a wait-for-slot semaphore. - Both queues still default to `attempts: 1`. Plan 2 raises this with exponential backoff and adds a transient/terminal error classifier. - CLAUDE.md is intentionally not updated by this plan — the unified passage describing both halves of the new contract lands in plan 015/2. Tests: - 5 new unit tests in `dispatch-compensator.test.ts` - 3 new unit tests in `agent-type-lock.test.ts` for `clearRecentlyDispatched` - 4 new unit tests in `bullmq-workers.test.ts` for the failed-event seam - 5 new unit tests in `lock-state-classifier.test.ts` - 2 new unit tests in `active-workers.test.ts` for the extended shape - 4 new unit tests in `webhook-processor.test.ts` for the three-way taxonomy - 3 new module-integration tests in `tests/integration/router/dispatch-failure-compensation.test.ts` exercise the real lock modules + real bullmq-workers.ts failed-event handler + real compensator end-to-end (only BullMQ's Worker constructor + the worker-env extractors are mocked). Full suite: 8515 passed / 23 skipped / 0 failed. Lint + typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 1 + ... 1-failed-event-lock-compensation.md.done} | 28 +-- src/router/active-workers.ts | 17 +- src/router/agent-type-lock.ts | 14 ++ src/router/bullmq-workers.ts | 20 ++ src/router/dispatch-compensator.ts | 49 +++++ src/router/lock-state-classifier.ts | 70 +++++++ src/router/webhook-processor.ts | 38 +++- .../dispatch-failure-compensation.test.ts | 193 ++++++++++++++++++ tests/unit/router/active-workers.test.ts | 36 +++- tests/unit/router/agent-type-lock.test.ts | 23 +++ tests/unit/router/bullmq-workers.test.ts | 75 +++++++ .../unit/router/dispatch-compensator.test.ts | 148 ++++++++++++++ .../unit/router/lock-state-classifier.test.ts | 122 +++++++++++ tests/unit/router/webhook-processor.test.ts | 97 ++++++++- 15 files changed, 911 insertions(+), 20 deletions(-) rename docs/plans/015-router-job-dispatch-failure-recovery/{1-failed-event-lock-compensation.md.wip => 1-failed-event-lock-compensation.md.done} (97%) create mode 100644 src/router/dispatch-compensator.ts create mode 100644 src/router/lock-state-classifier.ts create mode 100644 tests/integration/router/dispatch-failure-compensation.test.ts create mode 100644 tests/unit/router/dispatch-compensator.test.ts create mode 100644 tests/unit/router/lock-state-classifier.test.ts diff --git a/CHANGELOG.md b/CHANGELOG.md index d8bdc5de..66ea8f33 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,7 @@ All notable user-visible changes to CASCADE are documented here. The format is l ### Changed +- **Router dispatch failures now release in-memory locks via the BullMQ failed event** (spec 015, plan 1 of 2). Hooks `worker.on('failed')` on both `cascade-jobs` and `cascade-dashboard-jobs` queues to call a new `releaseLocksForFailedJob` compensator that releases the work-item lock, agent-type concurrency counter, and recently-dispatched dedup mark for any job whose dispatch fails. Closes the stranded-lock half of the prod incident verified on 2026-04-26 (ucho/MNG-350): a transient capacity miss was leaving the in-memory work-item lock wedged for 30 minutes, silently rejecting subsequent webhooks for the same trio. Also splits the webhook decision-reason vocabulary into three states — `Job queued` (success), `Awaiting worker slot: …` (in-flight, healthy), `Work item locked (no active dispatch): …` (wedged-lock canary, fires a Sentry capture tagged `wedged_lock_canary` so any regression in compensation is loud). Plan 2 closes the lost-job half (wait-for-slot, retry budget, error classifier). See [spec 015](docs/specs/015-router-job-dispatch-failure-recovery.md). - **`cascade-tools scm create-pr-review`: `--comment` alias + `--comments-file` escape hatch** (spec 014, plan 2 of 2). The command now accepts `--comment` (singular) as an alias for `--comments` — the exact muscle-memory mistake from prod run 5d993b04 now resolves correctly. Added `--comments-file ` (and `-` for stdin) as a JSON-parsed file alternative for long payloads that don't survive shell quoting. Zero edits to shared infrastructure (cliCommandFactory, manifestGenerator, nativeToolPrompts, errorEnvelope) — the two declarative fields on `createPRReviewDef.parameters.comments.cliAliases` + `createPRReviewDef.cli.fileInputAlternatives` are everything. Proves spec 014's single-entrypoint invariant: a new or evolved gadget should never need to touch shared machinery. See [spec 014](docs/specs/014-cascade-tools-agent-ergonomics.md). - **`cascade-tools` agent ergonomics: truthful system prompt, runnable `--help`, structured error envelope** (spec 014, plan 1 of 2). The system-prompt renderer that describes every cascade-tools command to agents now tells the truth about array-shaped parameters — no more silent `s`-stripping of names, no more ` (repeatable)` claim for array-of-object flags (they correctly render as `-- ''` now, with aliases appended via `|` and a one-line runnable JSON example inlined from the tool definition's `examples` block). Every CLI failure — flag-parse, JSON-parse, missing-required, enum-mismatch, unknown-flag, auth, runtime — emits a single structured envelope on stdout (`{"success":false,"error":{type,flag?,message,got?,expected?,hint?,example?}}`) plus a short prose summary on stderr for humans, replacing the ad-hoc mix of `this.error()` prose and `{success:false,error:""}` flat shapes. Mistyped flags get a "did you mean" suggestion via Levenshtein match against declared canonical names + aliases. `--help` now renders `def.examples` as copy-pasteable shell invocations under an `EXAMPLES` section. Root-caused by prod run 5d993b04-6e05-4ae1-b7de-8c274cf3496b where a review agent wasted ~2½ min fighting the prior pre-014 surface and ultimately dropped an inline PR comment. See [spec 014](docs/specs/014-cascade-tools-agent-ergonomics.md) + authoring guide at [`src/gadgets/README.md`](src/gadgets/README.md). - **`cascade-tools` now streams subprocess output live** (spec 013). The shared subprocess helper (on top of `execa` + `tree-kill`) forwards child stdout/stderr to the parent's stderr line-by-line as it arrives, emits a heartbeat line on stderr every 30 seconds of child silence (configurable), enforces both an idle-silence timeout (default 120s) and a wall-clock timeout (default 600s) with SIGTERM→SIGKILL escalation, and kills the full process tree on timeout. `git push` and `git commit` invoked by `scm create-pr` pass tighter per-caller timeouts and now return captured hook output in the result on success (previously discarded). Result shape is backward-compatible — `{ stdout, stderr, exitCode }` preserved; new optional `reason: 'idle-timeout' | 'wall-timeout'` surfaces when the helper killed the child. Motivation: LLM-driven CASCADE agents watching an output file could not distinguish a slow pre-push hook (~60s of silence) from a hung process, leading to retry loops that burned 5–10+ minutes of run budget. See [spec 013](docs/specs/013-subprocess-output-streaming.md). diff --git a/docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md.wip b/docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md.done similarity index 97% rename from docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md.wip rename to docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md.done index 8d53025f..02a42e30 100644 --- a/docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md.wip +++ b/docs/plans/015-router-job-dispatch-failure-recovery/1-failed-event-lock-compensation.md.done @@ -6,7 +6,7 @@ plan_slug: failed-event-lock-compensation level: plan parent_spec: docs/specs/015-router-job-dispatch-failure-recovery.md depends_on: [] -status: wip +status: done --- # 015/1: Failed-event lock compensation + decision-reason taxonomy @@ -204,16 +204,16 @@ CLAUDE.md is intentionally NOT updated by this plan; Plan 2 ships the unified pa ## Progress -- [ ] AC #1 (compensator releases all 3 locks on both queues) -- [ ] AC #2 (subsequent webhook not blocked by stranded state) -- [ ] AC #3 (three decision-reason branches) -- [ ] AC #4 (wedged-lock Sentry canary) -- [ ] AC #5 (manual-run bypass regression) -- [ ] AC #6 (clean-exit regression) -- [ ] AC #7 (TDD discipline) -- [ ] AC #8 (build) -- [ ] AC #9 (unit tests) -- [ ] AC #10 (integration test) -- [ ] AC #11 (lint) -- [ ] AC #12 (typecheck) -- [ ] AC #13 (docs) +- [x] AC #1 (compensator releases all 3 locks on both queues) +- [x] AC #2 (subsequent webhook not blocked by stranded state) +- [x] AC #3 (three decision-reason branches) +- [x] AC #4 (wedged-lock Sentry canary) +- [x] AC #5 (manual-run bypass regression) +- [x] AC #6 (clean-exit regression) +- [x] AC #7 (TDD discipline) +- [x] AC #8 (build) +- [x] AC #9 (unit tests) +- [x] AC #10 (integration test) +- [x] AC #11 (lint) +- [x] AC #12 (typecheck) +- [x] AC #13 (docs) diff --git a/src/router/active-workers.ts b/src/router/active-workers.ts index 4836e6f9..44ccd34a 100644 --- a/src/router/active-workers.ts +++ b/src/router/active-workers.ts @@ -62,11 +62,26 @@ export function getActiveWorkerCount(): number { /** * Get summary info for currently active workers. + * + * Includes the resolved `(projectId, workItemId, agentType)` trio so callers + * (specifically the lock-state classifier added in spec 015/1) can correlate + * an in-memory lock count against actual dispatch state. The fields are + * `undefined` for workers whose job data didn't carry the corresponding + * identifier — never synthesized. */ -export function getActiveWorkers(): Array<{ jobId: string; startedAt: Date }> { +export function getActiveWorkers(): Array<{ + jobId: string; + startedAt: Date; + projectId?: string; + workItemId?: string; + agentType?: string; +}> { return Array.from(activeWorkers.values()).map((w) => ({ jobId: w.jobId, startedAt: w.startedAt, + projectId: w.projectId, + workItemId: w.workItemId, + agentType: w.agentType, })); } diff --git a/src/router/agent-type-lock.ts b/src/router/agent-type-lock.ts index 2320fff5..177056cd 100644 --- a/src/router/agent-type-lock.ts +++ b/src/router/agent-type-lock.ts @@ -149,6 +149,20 @@ export function wasRecentlyDispatched( return true; } +/** + * Compensating action for `markRecentlyDispatched` — used by the BullMQ + * `failed`-event handler so a permanently-failed dispatch doesn't keep + * deduping a fresh webhook for ~60s while the user re-tries. + */ +export function clearRecentlyDispatched( + projectId: string, + agentType: string, + dedupScope?: string, +): void { + const key = makeDedupKey(projectId, agentType, dedupScope); + dedupMap.delete(key); +} + /** * Mark an agent type as recently dispatched for a project/scope. * The mark expires after DEDUP_TTL_MS and is NOT cleared on completion. diff --git a/src/router/bullmq-workers.ts b/src/router/bullmq-workers.ts index 603b3cba..c02a4ce2 100644 --- a/src/router/bullmq-workers.ts +++ b/src/router/bullmq-workers.ts @@ -10,6 +10,7 @@ import { type ConnectionOptions, type Job, Worker } from 'bullmq'; import { captureException } from '../sentry.js'; import { logger } from '../utils/logging.js'; import { parseRedisUrl } from '../utils/redis.js'; +import { releaseLocksForFailedJob } from './dispatch-compensator.js'; // Re-export so existing callers (worker-manager.ts) don't need to change imports. export { parseRedisUrl }; @@ -53,6 +54,25 @@ export function createQueueWorker(config: QueueWorkerConfig): Wo tags: { source: 'bullmq_dispatch', queue: queueName }, extra: { jobId: job?.id }, }); + // Compensate in-memory state (work-item lock, agent-type lock, + // recently-dispatched dedup mark) acquired by the webhook → enqueue + // path. Without this, dispatch failures wedge the locks until their + // TTLs expire — see spec 015. Compensator never throws, but we still + // guard so a future regression in it can't poison the worker. + if (job) { + void releaseLocksForFailedJob(job.data).catch((compErr) => { + logger.error( + '[WorkerManager] compensator threw — already swallowed by it; logging defensively', + { + jobId: job.id, + error: String(compErr), + }, + ); + captureException(compErr instanceof Error ? compErr : new Error(String(compErr)), { + tags: { source: 'dispatch_compensator_uncaught', queue: queueName }, + }); + }); + } }); worker.on('error', (err) => { diff --git a/src/router/dispatch-compensator.ts b/src/router/dispatch-compensator.ts new file mode 100644 index 00000000..826c9f5c --- /dev/null +++ b/src/router/dispatch-compensator.ts @@ -0,0 +1,49 @@ +/** + * Compensating action for dispatch failures. + * + * Released by BullMQ's `worker.on('failed')` handler so that any in-memory + * lock state acquired during the webhook → enqueue path (work-item lock, + * agent-type lock, recently-dispatched dedup mark) is freed the moment a + * dispatch attempt is declared dead. Without this, the lock entries leak + * for up to their TTL (work-item: 30 min) and silently reject every + * follow-up webhook for the same `(projectId, workItemId, agentType)`. + * + * The compensator NEVER propagates errors. A failure here would poison the + * BullMQ worker; instead we capture to Sentry and log, then resolve. + */ + +import { captureException } from '../sentry.js'; +import { logger } from '../utils/logging.js'; +import { clearAgentTypeEnqueued, clearRecentlyDispatched } from './agent-type-lock.js'; +import type { CascadeJob } from './queue.js'; +import { clearWorkItemEnqueued } from './work-item-lock.js'; +import { extractAgentType, extractProjectIdFromJob, extractWorkItemId } from './worker-env.js'; + +// Compensator accepts `unknown` because it runs from BullMQ's `failed` event +// where the job payload type is the queue's generic and not directly +// assignable to CascadeJob (manual-run / debug-analysis jobs come through +// `cascade-dashboard-jobs`). The extractors handle type-narrowing. +export async function releaseLocksForFailedJob(data: unknown): Promise { + try { + const projectId = await extractProjectIdFromJob(data as CascadeJob); + if (!projectId) return; + + const workItemId = extractWorkItemId(data as CascadeJob); + const agentType = extractAgentType(data as CascadeJob); + + if (workItemId && agentType) { + clearWorkItemEnqueued(projectId, workItemId, agentType); + } + if (agentType) { + clearAgentTypeEnqueued(projectId, agentType); + clearRecentlyDispatched(projectId, agentType, workItemId); + } + } catch (err) { + logger.error('[dispatch-compensator] failed to release locks for failed job', { + error: String(err), + }); + captureException(err instanceof Error ? err : new Error(String(err)), { + tags: { source: 'dispatch_compensator' }, + }); + } +} diff --git a/src/router/lock-state-classifier.ts b/src/router/lock-state-classifier.ts new file mode 100644 index 00000000..65975001 --- /dev/null +++ b/src/router/lock-state-classifier.ts @@ -0,0 +1,70 @@ +/** + * Lock-state classifier. + * + * Given a held in-memory work-item lock for `(projectId, workItemId, agentType)`, + * decide whether the lock corresponds to actual dispatch state or whether it + * has been stranded by a dispatch failure that didn't compensate. + * + * Returns one of: + * - 'awaiting-slot' — an active worker or a queued/waiting job matches the + * trio; the lock is healthy and the user-visible message should reflect + * "queued behind another run." + * - 'wedged' — neither correlation matches; the lock is stranded. After + * spec 015/1's compensator landed, this should never happen under normal + * operation. Its presence is a regression invariant: the caller is + * expected to capture it to Sentry as a canary. + * + * On classifier error (e.g. Redis hiccup during queue lookup), the result + * defaults to 'awaiting-slot' to avoid mis-emitting the wedged canary on a + * transient infrastructure blip. + */ + +import { logger } from '../utils/logging.js'; +import { getActiveWorkers } from './active-workers.js'; +import type { CascadeJob } from './queue.js'; +import { jobQueue } from './queue.js'; +import { extractAgentType, extractProjectIdFromJob, extractWorkItemId } from './worker-env.js'; + +export type LockStateClassification = 'awaiting-slot' | 'wedged'; + +export interface LockStateInput { + projectId: string; + workItemId: string; + agentType: string; +} + +export async function classifyLockState(input: LockStateInput): Promise { + const { projectId, workItemId, agentType } = input; + + // 1. Active worker correlation — fast in-memory map lookup. + const activeMatch = getActiveWorkers().some( + (w) => w.projectId === projectId && w.workItemId === workItemId && w.agentType === agentType, + ); + if (activeMatch) return 'awaiting-slot'; + + // 2. BullMQ queue correlation — only if the lock is held without a + // matching active worker. Limited to waiting/active states (jobs that + // BullMQ might still pick up). + try { + const jobs = await jobQueue.getJobs(['waiting', 'active']); + for (const job of jobs) { + // `getJobs` returns `Job` per the queue's generic. + const data = job.data as CascadeJob; + const jobProjectId = await extractProjectIdFromJob(data); + if (jobProjectId !== projectId) continue; + if (extractWorkItemId(data) !== workItemId) continue; + if (extractAgentType(data) !== agentType) continue; + return 'awaiting-slot'; + } + } catch (err) { + logger.warn('[lock-state-classifier] queue lookup failed; defaulting to awaiting-slot', { + error: String(err), + projectId, + workItemId, + agentType, + }); + return 'awaiting-slot'; + } + + return 'wedged'; +} diff --git a/src/router/webhook-processor.ts b/src/router/webhook-processor.ts index aff014ff..d2c9d486 100644 --- a/src/router/webhook-processor.ts +++ b/src/router/webhook-processor.ts @@ -15,6 +15,7 @@ import { getCoalesceWindowMs, registerPendingCreate, } from '../pm/create-coalesce-window.js'; +import { captureException } from '../sentry.js'; import type { TriggerRegistry } from '../triggers/registry.js'; import { logger } from '../utils/logging.js'; import { isDuplicateAction, markActionProcessed } from './action-dedup.js'; @@ -23,6 +24,7 @@ import { markAgentTypeEnqueued, markRecentlyDispatched, } from './agent-type-lock.js'; +import { classifyLockState } from './lock-state-classifier.js'; import type { RouterPlatformAdapter } from './platform-adapter.js'; import { addJob } from './queue.js'; import { isWorkItemLocked, markWorkItemEnqueued } from './work-item-lock.js'; @@ -190,10 +192,44 @@ export async function processRouterWebhook( blockedAgentType: result.agentType, reason: lockStatus.reason, }); + // Spec 015/1: distinguish "queued behind a real active dispatch" from + // "lock leaked by a prior dispatch failure". Defaults to awaiting-slot + // on classifier error so a transient infra blip doesn't mis-fire the + // canary. + const classification = await classifyLockState({ + projectId: project.id, + workItemId: result.workItemId, + agentType: result.agentType, + }); + const reasonSuffix = lockStatus.reason ?? 'active run exists'; + if (classification === 'wedged') { + // Regression invariant: after spec 015/1 ships, this should never + // fire under normal operation. Capture loudly so any leak is + // observable in production. + captureException( + new Error( + `wedged work-item lock: projectId=${project.id} workItemId=${result.workItemId} agentType=${result.agentType}`, + ), + { + tags: { source: 'wedged_lock_canary' }, + extra: { + projectId: project.id, + workItemId: result.workItemId, + agentType: result.agentType, + reason: lockStatus.reason, + }, + }, + ); + return { + shouldProcess: true, + projectId: project.id, + decisionReason: `Work item locked (no active dispatch): ${reasonSuffix}`, + }; + } return { shouldProcess: true, projectId: project.id, - decisionReason: `Work item locked: ${lockStatus.reason ?? 'active run exists'}`, + decisionReason: `Awaiting worker slot: ${reasonSuffix}`, }; } } diff --git a/tests/integration/router/dispatch-failure-compensation.test.ts b/tests/integration/router/dispatch-failure-compensation.test.ts new file mode 100644 index 00000000..c9a945fe --- /dev/null +++ b/tests/integration/router/dispatch-failure-compensation.test.ts @@ -0,0 +1,193 @@ +/** + * Module-integration test for spec 015/1. + * + * Wires the REAL `releaseLocksForFailedJob` compensator + REAL + * `bullmq-workers.ts` failed-event handler + REAL `agent-type-lock.ts` + * and `work-item-lock.ts` modules, mocking only BullMQ's `Worker` + * constructor (so we can drive the `failed` event synthetically) and the + * `worker-env.ts` extractors (so we don't need the manifest registry + + * DB lookups). This is the load-bearing seam from spec 015/1: when + * BullMQ declares a job failed, the lock state acquired during the + * webhook → enqueue path must be released — and a follow-up webhook + * for the same trio must NOT be blocked. + */ + +import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; + +vi.mock('bullmq', () => ({ + Worker: vi.fn().mockImplementation((_queueName, _processFn, _opts) => ({ + on: vi.fn(), + })), +})); + +vi.mock('../../../src/sentry.js', () => ({ + captureException: vi.fn(), +})); + +vi.mock('../../../src/utils/logging.js', () => ({ + logger: { info: vi.fn(), warn: vi.fn(), error: vi.fn(), debug: vi.fn() }, +})); + +vi.mock('../../../src/router/worker-env.js', () => ({ + extractProjectIdFromJob: vi.fn(), + extractWorkItemId: vi.fn(), + extractAgentType: vi.fn(), +})); + +import { Worker } from 'bullmq'; +import { + clearAllAgentTypeLocks, + markAgentTypeEnqueued, + markRecentlyDispatched, + wasRecentlyDispatched, +} from '../../../src/router/agent-type-lock.js'; +import { createQueueWorker } from '../../../src/router/bullmq-workers.js'; +import { + clearAllWorkItemLocks, + isWorkItemLocked, + markWorkItemEnqueued, +} from '../../../src/router/work-item-lock.js'; +import { + extractAgentType, + extractProjectIdFromJob, + extractWorkItemId, +} from '../../../src/router/worker-env.js'; + +const MockWorker = vi.mocked(Worker); +const mockExtractProjectIdFromJob = vi.mocked(extractProjectIdFromJob); +const mockExtractWorkItemId = vi.mocked(extractWorkItemId); +const mockExtractAgentType = vi.mocked(extractAgentType); + +describe('spec 015/1: dispatch-failure compensation (module-integration)', () => { + beforeEach(() => { + clearAllAgentTypeLocks(); + clearAllWorkItemLocks(); + MockWorker.mockClear(); + MockWorker.mockImplementation( + (_queueName, _processFn, _opts) => + ({ + on: vi.fn(), + }) as never, + ); + mockExtractProjectIdFromJob.mockReset(); + mockExtractWorkItemId.mockReset(); + mockExtractAgentType.mockReset(); + }); + + afterEach(() => { + vi.restoreAllMocks(); + }); + + it('releases work-item + agent-type + recently-dispatched locks when BullMQ declares the job failed', async () => { + // Webhook → enqueue path simulated: marks the locks like webhook-processor.ts does. + markWorkItemEnqueued('ucho', 'MNG-350', 'implementation'); + markAgentTypeEnqueued('ucho', 'implementation'); + markRecentlyDispatched('ucho', 'implementation', 'MNG-350'); + + // Sanity — locks are held. + expect((await isWorkItemLocked('ucho', 'MNG-350', 'implementation')).locked).toBe(true); + expect(wasRecentlyDispatched('ucho', 'implementation', 'MNG-350')).toBe(true); + + // Construct a real worker via the factory; capture its registered handlers. + const worker = createQueueWorker({ + queueName: 'cascade-jobs', + label: 'Job', + connection: { host: 'localhost', port: 6379 }, + concurrency: 1, + lockDuration: 60_000, + processFn: vi.fn().mockResolvedValue(undefined), + }); + const failedHandler = vi.mocked(worker.on).mock.calls.find((c) => c[0] === 'failed')?.[1] as ( + job: { id: string; data: unknown } | undefined, + err: Error, + ) => void; + expect(failedHandler).toBeDefined(); + + // Drive the extractors so the compensator resolves to the same trio. + mockExtractProjectIdFromJob.mockResolvedValue('ucho'); + mockExtractWorkItemId.mockReturnValue('MNG-350'); + mockExtractAgentType.mockReturnValue('implementation'); + + // Synthetic failed event — the compensator runs as a side-effect. + failedHandler( + { id: 'linear-1777217350854-2qvhjo', data: { type: 'linear' } }, + new Error('No worker slots available'), + ); + + // Compensator is async; let microtasks drain. + await new Promise((r) => setImmediate(r)); + + // Locks released — a fresh webhook for the same trio is NOT blocked. + expect((await isWorkItemLocked('ucho', 'MNG-350', 'implementation')).locked).toBe(false); + expect(wasRecentlyDispatched('ucho', 'implementation', 'MNG-350')).toBe(false); + }); + + it('does NOT release locks for a job whose extractors return null projectId (foreign provider)', async () => { + markWorkItemEnqueued('ucho', 'MNG-350', 'implementation'); + expect((await isWorkItemLocked('ucho', 'MNG-350', 'implementation')).locked).toBe(true); + + const worker = createQueueWorker({ + queueName: 'cascade-jobs', + label: 'Job', + connection: { host: 'localhost', port: 6379 }, + concurrency: 1, + lockDuration: 60_000, + processFn: vi.fn().mockResolvedValue(undefined), + }); + const failedHandler = vi.mocked(worker.on).mock.calls.find((c) => c[0] === 'failed')?.[1] as ( + job: { id: string; data: unknown } | undefined, + err: Error, + ) => void; + + mockExtractProjectIdFromJob.mockResolvedValue(null); + mockExtractWorkItemId.mockReturnValue('MNG-350'); + mockExtractAgentType.mockReturnValue('implementation'); + + failedHandler({ id: 'foreign-job', data: { type: 'something-else' } }, new Error('boom')); + await new Promise((r) => setImmediate(r)); + + // Lock for ucho/MNG-350 stays — foreign-provider failures never touch + // the trio we care about. (extractors returned null projectId.) + expect((await isWorkItemLocked('ucho', 'MNG-350', 'implementation')).locked).toBe(true); + }); + + it('manual-run job with full trio gets its locks released too (dashboard queue parity)', async () => { + markWorkItemEnqueued('ucho', 'MNG-350', 'implementation'); + markAgentTypeEnqueued('ucho', 'implementation'); + markRecentlyDispatched('ucho', 'implementation', 'MNG-350'); + + const worker = createQueueWorker({ + queueName: 'cascade-dashboard-jobs', + label: 'Dashboard job', + connection: { host: 'localhost', port: 6379 }, + concurrency: 1, + lockDuration: 60_000, + processFn: vi.fn().mockResolvedValue(undefined), + }); + const failedHandler = vi.mocked(worker.on).mock.calls.find((c) => c[0] === 'failed')?.[1] as ( + job: { id: string; data: unknown } | undefined, + err: Error, + ) => void; + + mockExtractProjectIdFromJob.mockResolvedValue('ucho'); + mockExtractWorkItemId.mockReturnValue('MNG-350'); + mockExtractAgentType.mockReturnValue('implementation'); + + failedHandler( + { + id: 'manual-run-1777219028558-jvvxni', + data: { + type: 'manual-run', + projectId: 'ucho', + workItemId: 'MNG-350', + agentType: 'implementation', + }, + }, + new Error('boom'), + ); + await new Promise((r) => setImmediate(r)); + + expect((await isWorkItemLocked('ucho', 'MNG-350', 'implementation')).locked).toBe(false); + expect(wasRecentlyDispatched('ucho', 'implementation', 'MNG-350')).toBe(false); + }); +}); diff --git a/tests/unit/router/active-workers.test.ts b/tests/unit/router/active-workers.test.ts index 8c396e23..33293b4e 100644 --- a/tests/unit/router/active-workers.test.ts +++ b/tests/unit/router/active-workers.test.ts @@ -115,7 +115,41 @@ describe('active-workers', () => { activeWorkers.set('job-1', makeActiveWorker({ jobId: 'job-1', startedAt })); const workers = getActiveWorkers(); expect(workers).toHaveLength(1); - expect(workers[0]).toEqual({ jobId: 'job-1', startedAt }); + // Allow extra (projectId/workItemId/agentType) fields — they're added + // in spec 015/1 so the lock-state classifier can correlate locks with + // active dispatch state. Pin only the load-bearing fields here. + expect(workers[0]).toMatchObject({ jobId: 'job-1', startedAt }); + }); + + it('returns projectId, workItemId, agentType for each tracked worker (spec 015/1)', () => { + const startedAt = new Date(); + activeWorkers.set( + 'job-7', + makeActiveWorker({ + jobId: 'job-7', + startedAt, + projectId: 'ucho', + workItemId: 'MNG-350', + agentType: 'implementation', + }), + ); + const workers = getActiveWorkers(); + expect(workers).toHaveLength(1); + expect(workers[0]).toMatchObject({ + jobId: 'job-7', + startedAt, + projectId: 'ucho', + workItemId: 'MNG-350', + agentType: 'implementation', + }); + }); + + it('omitted projectId/workItemId/agentType remain undefined (no synthetic defaults)', () => { + activeWorkers.set('job-bare', makeActiveWorker({ jobId: 'job-bare' })); + const workers = getActiveWorkers(); + expect(workers[0]?.projectId).toBeUndefined(); + expect(workers[0]?.workItemId).toBeUndefined(); + expect(workers[0]?.agentType).toBeUndefined(); }); }); diff --git a/tests/unit/router/agent-type-lock.test.ts b/tests/unit/router/agent-type-lock.test.ts index c296c091..744514c0 100644 --- a/tests/unit/router/agent-type-lock.test.ts +++ b/tests/unit/router/agent-type-lock.test.ts @@ -19,6 +19,7 @@ import { checkAgentTypeConcurrency, clearAgentTypeEnqueued, clearAllAgentTypeLocks, + clearRecentlyDispatched, isAgentTypeLocked, markAgentTypeEnqueued, markRecentlyDispatched, @@ -194,6 +195,28 @@ describe('agent-type-lock', () => { expect(wasRecentlyDispatched('proj1', 'planning', 'TF-10')).toBe(false); expect(wasRecentlyDispatched('proj1', 'planning')).toBe(false); }); + + it('clearRecentlyDispatched removes the dedup entry for a (projectId, agentType, dedupScope) trio', () => { + markRecentlyDispatched('proj1', 'implementation', 'w1'); + expect(wasRecentlyDispatched('proj1', 'implementation', 'w1')).toBe(true); + clearRecentlyDispatched('proj1', 'implementation', 'w1'); + expect(wasRecentlyDispatched('proj1', 'implementation', 'w1')).toBe(false); + }); + + it('clearRecentlyDispatched is a no-op when the key was not previously marked', () => { + expect(() => clearRecentlyDispatched('proj1', 'implementation', 'w1')).not.toThrow(); + expect(wasRecentlyDispatched('proj1', 'implementation', 'w1')).toBe(false); + }); + + it('clearRecentlyDispatched leaves entries for other (agentType, scope) keys untouched', () => { + markRecentlyDispatched('proj1', 'implementation', 'w1'); + markRecentlyDispatched('proj1', 'review', 'w1'); + markRecentlyDispatched('proj1', 'implementation', 'w2'); + clearRecentlyDispatched('proj1', 'implementation', 'w1'); + expect(wasRecentlyDispatched('proj1', 'implementation', 'w1')).toBe(false); + expect(wasRecentlyDispatched('proj1', 'review', 'w1')).toBe(true); + expect(wasRecentlyDispatched('proj1', 'implementation', 'w2')).toBe(true); + }); }); // ======================================================================== diff --git a/tests/unit/router/bullmq-workers.test.ts b/tests/unit/router/bullmq-workers.test.ts index c2f9a78f..bc6ddc41 100644 --- a/tests/unit/router/bullmq-workers.test.ts +++ b/tests/unit/router/bullmq-workers.test.ts @@ -14,6 +14,10 @@ vi.mock('../../../src/sentry.js', () => ({ captureException: vi.fn(), })); +vi.mock('../../../src/router/dispatch-compensator.js', () => ({ + releaseLocksForFailedJob: vi.fn().mockResolvedValue(undefined), +})); + // Mock logger vi.mock('../../../src/utils/logging.js', () => ({ logger: { @@ -30,16 +34,20 @@ vi.mock('../../../src/utils/logging.js', () => ({ import { Worker } from 'bullmq'; import { createQueueWorker, parseRedisUrl } from '../../../src/router/bullmq-workers.js'; +import { releaseLocksForFailedJob } from '../../../src/router/dispatch-compensator.js'; import { captureException } from '../../../src/sentry.js'; import { logger } from '../../../src/utils/logging.js'; const MockWorker = vi.mocked(Worker); const mockCaptureException = vi.mocked(captureException); const mockLogger = vi.mocked(logger); +const mockReleaseLocksForFailedJob = vi.mocked(releaseLocksForFailedJob); beforeEach(() => { MockWorker.mockClear(); mockCaptureException.mockClear(); + mockReleaseLocksForFailedJob.mockClear(); + mockReleaseLocksForFailedJob.mockResolvedValue(undefined); // Re-establish default mock so each test gets a fresh mock worker MockWorker.mockImplementation( (_queueName, _processFn, _opts) => @@ -192,4 +200,71 @@ describe('createQueueWorker', () => { }), ); }); + + it("worker.on('failed') invokes releaseLocksForFailedJob with job.data", () => { + const worker = createQueueWorker(baseConfig); + const mockOn = vi.mocked(worker.on); + + const failedCall = mockOn.mock.calls.find((call) => call[0] === 'failed'); + const handler = failedCall?.[1] as ( + job: { id: string; data: unknown } | undefined, + err: Error, + ) => void; + const jobData = { type: 'linear', payload: 'foo' }; + handler({ id: 'job-99', data: jobData }, new Error('boom')); + + expect(mockReleaseLocksForFailedJob).toHaveBeenCalledTimes(1); + expect(mockReleaseLocksForFailedJob).toHaveBeenCalledWith(jobData); + }); + + it("worker.on('failed') still logs and Sentries on top of compensating", () => { + mockLogger.error.mockReset(); + const worker = createQueueWorker(baseConfig); + + const handler = vi.mocked(worker.on).mock.calls.find((c) => c[0] === 'failed')?.[1] as ( + job: { id: string; data: unknown } | undefined, + err: Error, + ) => void; + handler({ id: 'job-100', data: { type: 'github' } }, new Error('nope')); + + expect(mockLogger.error).toHaveBeenCalled(); + expect(mockCaptureException).toHaveBeenCalled(); + expect(mockReleaseLocksForFailedJob).toHaveBeenCalled(); + }); + + it("worker.on('failed') swallows compensator throws", async () => { + mockReleaseLocksForFailedJob.mockRejectedValueOnce(new Error('compensator boom')); + const worker = createQueueWorker(baseConfig); + const mockOn = vi.mocked(worker.on); + + const handler = mockOn.mock.calls.find((c) => c[0] === 'failed')?.[1] as ( + job: { id: string; data: unknown } | undefined, + err: Error, + ) => void; + + // Calling the handler must not propagate the compensator rejection. + // We invoke it and let the microtask queue drain — there must be no + // unhandled rejection in test logs. + expect(() => + handler({ id: 'job-101', data: { type: 'github' } }, new Error('x')), + ).not.toThrow(); + // Drain the rejection by giving microtasks a turn. + await new Promise((r) => setImmediate(r)); + // Test passes if we got here without an unhandled rejection killing vitest. + }); + + it("worker.on('failed') does not call compensator when job is undefined", () => { + const worker = createQueueWorker(baseConfig); + const mockOn = vi.mocked(worker.on); + + const handler = mockOn.mock.calls.find((c) => c[0] === 'failed')?.[1] as ( + job: { id: string; data: unknown } | undefined, + err: Error, + ) => void; + handler(undefined, new Error('orphan')); + + expect(mockReleaseLocksForFailedJob).not.toHaveBeenCalled(); + // Existing log + Sentry behavior preserved + expect(mockLogger.error).toHaveBeenCalled(); + }); }); diff --git a/tests/unit/router/dispatch-compensator.test.ts b/tests/unit/router/dispatch-compensator.test.ts new file mode 100644 index 00000000..8ec7441c --- /dev/null +++ b/tests/unit/router/dispatch-compensator.test.ts @@ -0,0 +1,148 @@ +import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; + +// Mock the worker-env extractors so we can drive their return values per test +// without standing up the manifest registry / DB lookups they normally consult. +vi.mock('../../../src/router/worker-env.js', () => ({ + extractProjectIdFromJob: vi.fn(), + extractWorkItemId: vi.fn(), + extractAgentType: vi.fn(), +})); + +vi.mock('../../../src/router/work-item-lock.js', () => ({ + clearWorkItemEnqueued: vi.fn(), +})); + +vi.mock('../../../src/router/agent-type-lock.js', () => ({ + clearAgentTypeEnqueued: vi.fn(), + clearRecentlyDispatched: vi.fn(), +})); + +vi.mock('../../../src/sentry.js', () => ({ + captureException: vi.fn(), +})); + +vi.mock('../../../src/utils/logging.js', () => ({ + logger: { info: vi.fn(), warn: vi.fn(), error: vi.fn(), debug: vi.fn() }, +})); + +import { + clearAgentTypeEnqueued, + clearRecentlyDispatched, +} from '../../../src/router/agent-type-lock.js'; +import { releaseLocksForFailedJob } from '../../../src/router/dispatch-compensator.js'; +import { clearWorkItemEnqueued } from '../../../src/router/work-item-lock.js'; +import { + extractAgentType, + extractProjectIdFromJob, + extractWorkItemId, +} from '../../../src/router/worker-env.js'; +import { captureException } from '../../../src/sentry.js'; + +const mockExtractProjectIdFromJob = vi.mocked(extractProjectIdFromJob); +const mockExtractWorkItemId = vi.mocked(extractWorkItemId); +const mockExtractAgentType = vi.mocked(extractAgentType); +const mockClearWorkItemEnqueued = vi.mocked(clearWorkItemEnqueued); +const mockClearAgentTypeEnqueued = vi.mocked(clearAgentTypeEnqueued); +const mockClearRecentlyDispatched = vi.mocked(clearRecentlyDispatched); +const mockCaptureException = vi.mocked(captureException); + +describe('releaseLocksForFailedJob', () => { + beforeEach(() => { + mockExtractProjectIdFromJob.mockReset(); + mockExtractWorkItemId.mockReset(); + mockExtractAgentType.mockReset(); + mockClearWorkItemEnqueued.mockReset(); + mockClearAgentTypeEnqueued.mockReset(); + mockClearRecentlyDispatched.mockReset(); + mockCaptureException.mockReset(); + }); + + afterEach(() => { + vi.restoreAllMocks(); + }); + + it('releases work-item, agent-type, and recently-dispatched marks for a CascadeJob with all three identifiers', async () => { + mockExtractProjectIdFromJob.mockResolvedValue('p1'); + mockExtractWorkItemId.mockReturnValue('w1'); + mockExtractAgentType.mockReturnValue('implementation'); + + // biome-ignore lint/suspicious/noExplicitAny: test fixture, shape is irrelevant + await releaseLocksForFailedJob({ type: 'linear' } as any); + + expect(mockClearWorkItemEnqueued).toHaveBeenCalledTimes(1); + expect(mockClearWorkItemEnqueued).toHaveBeenCalledWith('p1', 'w1', 'implementation'); + expect(mockClearAgentTypeEnqueued).toHaveBeenCalledTimes(1); + expect(mockClearAgentTypeEnqueued).toHaveBeenCalledWith('p1', 'implementation'); + expect(mockClearRecentlyDispatched).toHaveBeenCalledTimes(1); + expect(mockClearRecentlyDispatched).toHaveBeenCalledWith('p1', 'implementation', 'w1'); + }); + + it('no-ops cleanly when projectId is null (e.g. foreign-provider payload)', async () => { + mockExtractProjectIdFromJob.mockResolvedValue(null); + mockExtractWorkItemId.mockReturnValue('w1'); + mockExtractAgentType.mockReturnValue('implementation'); + + // biome-ignore lint/suspicious/noExplicitAny: test fixture + await releaseLocksForFailedJob({ type: 'linear' } as any); + + expect(mockClearWorkItemEnqueued).not.toHaveBeenCalled(); + expect(mockClearAgentTypeEnqueued).not.toHaveBeenCalled(); + expect(mockClearRecentlyDispatched).not.toHaveBeenCalled(); + }); + + it('releases agent-type-lock + recently-dispatched even when workItemId is undefined', async () => { + mockExtractProjectIdFromJob.mockResolvedValue('p1'); + mockExtractWorkItemId.mockReturnValue(undefined); + mockExtractAgentType.mockReturnValue('backlog-manager'); + + // biome-ignore lint/suspicious/noExplicitAny: test fixture + await releaseLocksForFailedJob({ type: 'manual-run', projectId: 'p1' } as any); + + expect(mockClearWorkItemEnqueued).not.toHaveBeenCalled(); + expect(mockClearAgentTypeEnqueued).toHaveBeenCalledWith('p1', 'backlog-manager'); + expect(mockClearRecentlyDispatched).toHaveBeenCalledWith('p1', 'backlog-manager', undefined); + }); + + it('handles a DashboardJob (manual-run) without throwing', async () => { + mockExtractProjectIdFromJob.mockResolvedValue('p1'); + mockExtractWorkItemId.mockReturnValue('MNG-350'); + mockExtractAgentType.mockReturnValue('implementation'); + + await expect( + releaseLocksForFailedJob({ + type: 'manual-run', + projectId: 'p1', + workItemId: 'MNG-350', + agentType: 'implementation', + }), + ).resolves.toBeUndefined(); + expect(mockClearWorkItemEnqueued).toHaveBeenCalledWith('p1', 'MNG-350', 'implementation'); + }); + + it('captureException when an extractor throws; never propagates', async () => { + mockExtractProjectIdFromJob.mockRejectedValue(new Error('extractor boom')); + + await expect( + // biome-ignore lint/suspicious/noExplicitAny: test fixture + releaseLocksForFailedJob({ type: 'linear' } as any), + ).resolves.toBeUndefined(); + + expect(mockCaptureException).toHaveBeenCalledTimes(1); + const [errArg, ctx] = mockCaptureException.mock.calls[0] ?? []; + expect(errArg).toBeInstanceOf(Error); + expect(ctx?.tags?.source).toBe('dispatch_compensator'); + }); + + it('skips agent-type / recently-dispatched if agentType is undefined', async () => { + mockExtractProjectIdFromJob.mockResolvedValue('p1'); + mockExtractWorkItemId.mockReturnValue('w1'); + mockExtractAgentType.mockReturnValue(undefined); + + // biome-ignore lint/suspicious/noExplicitAny: test fixture + await releaseLocksForFailedJob({ type: 'github' } as any); + + expect(mockClearWorkItemEnqueued).not.toHaveBeenCalled(); + expect(mockClearAgentTypeEnqueued).not.toHaveBeenCalled(); + expect(mockClearRecentlyDispatched).not.toHaveBeenCalled(); + }); +}); diff --git a/tests/unit/router/lock-state-classifier.test.ts b/tests/unit/router/lock-state-classifier.test.ts new file mode 100644 index 00000000..ee436508 --- /dev/null +++ b/tests/unit/router/lock-state-classifier.test.ts @@ -0,0 +1,122 @@ +import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; + +vi.mock('../../../src/router/active-workers.js', () => ({ + getActiveWorkers: vi.fn(), +})); + +vi.mock('../../../src/router/queue.js', () => ({ + jobQueue: { + getJobs: vi.fn(), + }, +})); + +vi.mock('../../../src/router/worker-env.js', () => ({ + extractProjectIdFromJob: vi.fn(), + extractWorkItemId: vi.fn(), + extractAgentType: vi.fn(), +})); + +vi.mock('../../../src/utils/logging.js', () => ({ + logger: { info: vi.fn(), warn: vi.fn(), error: vi.fn(), debug: vi.fn() }, +})); + +import { getActiveWorkers } from '../../../src/router/active-workers.js'; +import { classifyLockState } from '../../../src/router/lock-state-classifier.js'; +import { jobQueue } from '../../../src/router/queue.js'; +import { + extractAgentType, + extractProjectIdFromJob, + extractWorkItemId, +} from '../../../src/router/worker-env.js'; + +const mockGetActiveWorkers = vi.mocked(getActiveWorkers); +const mockGetJobs = vi.mocked(jobQueue.getJobs); +const mockExtractProjectIdFromJob = vi.mocked(extractProjectIdFromJob); +const mockExtractWorkItemId = vi.mocked(extractWorkItemId); +const mockExtractAgentType = vi.mocked(extractAgentType); + +describe('classifyLockState', () => { + const trio = { projectId: 'ucho', workItemId: 'MNG-350', agentType: 'implementation' as const }; + + beforeEach(() => { + mockGetActiveWorkers.mockReset(); + mockGetJobs.mockReset(); + mockExtractProjectIdFromJob.mockReset(); + mockExtractWorkItemId.mockReset(); + mockExtractAgentType.mockReset(); + }); + + afterEach(() => { + vi.restoreAllMocks(); + }); + + it("returns 'awaiting-slot' when an active worker matches the trio", async () => { + mockGetActiveWorkers.mockReturnValue([ + { + jobId: 'job-x', + startedAt: new Date(), + projectId: 'ucho', + workItemId: 'MNG-350', + agentType: 'implementation', + }, + ]); + mockGetJobs.mockResolvedValue([]); + + const result = await classifyLockState(trio); + expect(result).toBe('awaiting-slot'); + }); + + it("returns 'awaiting-slot' when a queued job in waiting/active state matches the trio", async () => { + mockGetActiveWorkers.mockReturnValue([]); + // One job with matching extracted trio + mockGetJobs.mockResolvedValue([ + // biome-ignore lint/suspicious/noExplicitAny: BullMQ Job test fixture + { id: 'q-1', data: { type: 'linear' } } as any, + ]); + mockExtractProjectIdFromJob.mockResolvedValue('ucho'); + mockExtractWorkItemId.mockReturnValue('MNG-350'); + mockExtractAgentType.mockReturnValue('implementation'); + + const result = await classifyLockState(trio); + expect(result).toBe('awaiting-slot'); + }); + + it("returns 'wedged' when no active worker and no queued job matches", async () => { + mockGetActiveWorkers.mockReturnValue([]); + mockGetJobs.mockResolvedValue([]); + + const result = await classifyLockState(trio); + expect(result).toBe('wedged'); + }); + + it("returns 'wedged' when active workers and queued jobs exist but for a different trio", async () => { + mockGetActiveWorkers.mockReturnValue([ + { + jobId: 'job-other', + startedAt: new Date(), + projectId: 'ucho', + workItemId: 'MNG-999', + agentType: 'implementation', + }, + ]); + mockGetJobs.mockResolvedValue([ + // biome-ignore lint/suspicious/noExplicitAny: test fixture + { id: 'q-other', data: { type: 'linear' } } as any, + ]); + mockExtractProjectIdFromJob.mockResolvedValue('ucho'); + mockExtractWorkItemId.mockReturnValue('MNG-998'); + mockExtractAgentType.mockReturnValue('implementation'); + + const result = await classifyLockState(trio); + expect(result).toBe('wedged'); + }); + + it("returns 'awaiting-slot' (safe fallback) when the queue lookup throws", async () => { + mockGetActiveWorkers.mockReturnValue([]); + mockGetJobs.mockRejectedValue(new Error('redis hiccup')); + + const result = await classifyLockState(trio); + // Safe fallback: do NOT mis-emit the wedged-lock canary on classifier error. + expect(result).toBe('awaiting-slot'); + }); +}); diff --git a/tests/unit/router/webhook-processor.test.ts b/tests/unit/router/webhook-processor.test.ts index 49373f70..3c7e4701 100644 --- a/tests/unit/router/webhook-processor.test.ts +++ b/tests/unit/router/webhook-processor.test.ts @@ -24,15 +24,23 @@ vi.mock('../../../src/router/action-dedup.js', () => ({ isDuplicateAction: vi.fn().mockReturnValue(false), markActionProcessed: vi.fn(), })); +vi.mock('../../../src/router/lock-state-classifier.js', () => ({ + classifyLockState: vi.fn().mockResolvedValue('awaiting-slot'), +})); +vi.mock('../../../src/sentry.js', () => ({ + captureException: vi.fn(), +})); import { isDuplicateAction, markActionProcessed } from '../../../src/router/action-dedup.js'; import { checkAgentTypeConcurrency } from '../../../src/router/agent-type-lock.js'; import type { RouterProjectConfig } from '../../../src/router/config.js'; +import { classifyLockState } from '../../../src/router/lock-state-classifier.js'; import type { RouterPlatformAdapter } from '../../../src/router/platform-adapter.js'; import type { CascadeJob } from '../../../src/router/queue.js'; import { addJob } from '../../../src/router/queue.js'; import { processRouterWebhook } from '../../../src/router/webhook-processor.js'; import { isWorkItemLocked, markWorkItemEnqueued } from '../../../src/router/work-item-lock.js'; +import { captureException } from '../../../src/sentry.js'; import type { TriggerRegistry } from '../../../src/triggers/registry.js'; const mockProject: RouterProjectConfig = { @@ -348,7 +356,7 @@ describe('processRouterWebhook', () => { expect(addJob).toHaveBeenCalled(); }); - it('skips job when work item is locked', async () => { + it("emits 'Awaiting worker slot' when lock held and classifier returns 'awaiting-slot' (spec 015/1)", async () => { const triggerResult = { agentType: 'implementation', agentInput: { cardId: 'card1' }, @@ -356,8 +364,9 @@ describe('processRouterWebhook', () => { }; vi.mocked(isWorkItemLocked).mockResolvedValueOnce({ locked: true, - reason: 'db: active run exists', + reason: 'in-memory same-type: 1 enqueued (max 1 per type)', }); + vi.mocked(classifyLockState).mockResolvedValueOnce('awaiting-slot'); const adapter = makeMockAdapter({ dispatchWithCredentials: vi.fn().mockResolvedValue(triggerResult), }); @@ -365,11 +374,93 @@ describe('processRouterWebhook', () => { const result = await processRouterWebhook(adapter, {}, mockTriggerRegistry); expect(result.shouldProcess).toBe(true); expect(result.projectId).toBe('p1'); - expect(result.decisionReason).toBe('Work item locked: db: active run exists'); + expect(result.decisionReason).toBe( + 'Awaiting worker slot: in-memory same-type: 1 enqueued (max 1 per type)', + ); expect(addJob).not.toHaveBeenCalled(); expect(adapter.postAck).not.toHaveBeenCalled(); }); + it("emits 'Work item locked (no active dispatch)' when classifier returns 'wedged' (spec 015/1)", async () => { + const triggerResult = { + agentType: 'implementation', + agentInput: { cardId: 'card1' }, + workItemId: 'card1', + }; + vi.mocked(isWorkItemLocked).mockResolvedValueOnce({ + locked: true, + reason: 'in-memory same-type: 1 enqueued (max 1 per type)', + }); + vi.mocked(classifyLockState).mockResolvedValueOnce('wedged'); + const adapter = makeMockAdapter({ + dispatchWithCredentials: vi.fn().mockResolvedValue(triggerResult), + }); + + const result = await processRouterWebhook(adapter, {}, mockTriggerRegistry); + expect(result.shouldProcess).toBe(true); + expect(result.decisionReason).toBe( + 'Work item locked (no active dispatch): in-memory same-type: 1 enqueued (max 1 per type)', + ); + // Wedged-lock canary fires a Sentry capture so the regression invariant + // is loud — see spec 015/1 AC #6. + expect(captureException).toHaveBeenCalledWith( + expect.any(Error), + expect.objectContaining({ + tags: expect.objectContaining({ source: 'wedged_lock_canary' }), + }), + ); + }); + + it('preserves existing log fields when work item is locked', async () => { + const triggerResult = { + agentType: 'implementation', + agentInput: { cardId: 'card1' }, + workItemId: 'card1', + }; + vi.mocked(isWorkItemLocked).mockResolvedValueOnce({ + locked: true, + reason: 'db: active run exists', + }); + vi.mocked(classifyLockState).mockResolvedValueOnce('awaiting-slot'); + const { logger } = await import('../../../src/utils/logging.js'); + vi.mocked(logger.info).mockClear(); + const adapter = makeMockAdapter({ + dispatchWithCredentials: vi.fn().mockResolvedValue(triggerResult), + }); + + await processRouterWebhook(adapter, {}, mockTriggerRegistry); + // Find the Skipping … log call. Existing structure pins these fields. + const skipCall = vi + .mocked(logger.info) + .mock.calls.find((c) => String(c[0]).includes('work item already locked')); + expect(skipCall).toBeDefined(); + expect(skipCall?.[1]).toMatchObject({ + source: 'trello', + projectId: 'p1', + workItemId: 'card1', + blockedAgentType: 'implementation', + reason: 'db: active run exists', + }); + }); + + it('does not call classifyLockState when work item is not locked (perf invariant)', async () => { + vi.mocked(classifyLockState).mockClear(); + vi.mocked(isWorkItemLocked).mockResolvedValueOnce({ locked: false }); + const triggerResult = { + agentType: 'implementation', + agentInput: { cardId: 'card1' }, + workItemId: 'card1', + }; + vi.mocked(addJob).mockResolvedValueOnce('job-x'); + const adapter = makeMockAdapter({ + dispatchWithCredentials: vi.fn().mockResolvedValue(triggerResult), + }); + + await processRouterWebhook(adapter, {}, mockTriggerRegistry); + // Happy path must not pay the queue-lookup cost. + expect(classifyLockState).not.toHaveBeenCalled(); + }); + it('calls onBlocked when work item is locked', async () => { const onBlocked = vi.fn(); const triggerResult = { From a12ad40601b107fcb06304327927841b20e42526 Mon Sep 17 00:00:00 2001 From: Zbigniew Sobiecki Date: Sun, 26 Apr 2026 16:34:11 +0000 Subject: [PATCH 4/7] chore(plan): lock 015/2 wait-for-slot-and-retry-classifier Co-Authored-By: Claude Opus 4.7 (1M context) --- ...-classifier.md => 2-wait-for-slot-and-retry-classifier.md.wip} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename docs/plans/015-router-job-dispatch-failure-recovery/{2-wait-for-slot-and-retry-classifier.md => 2-wait-for-slot-and-retry-classifier.md.wip} (100%) diff --git a/docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md b/docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md.wip similarity index 100% rename from docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md rename to docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md.wip From cf0c59be60aec193d9b5f008f5c506ab2632b61b Mon Sep 17 00:00:00 2001 From: Zbigniew Sobiecki Date: Sun, 26 Apr 2026 16:34:36 +0000 Subject: [PATCH 5/7] chore(plan): mark 015/2 status: wip --- .../2-wait-for-slot-and-retry-classifier.md.wip | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md.wip b/docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md.wip index 37f3e24c..a3d3dc32 100644 --- a/docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md.wip +++ b/docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md.wip @@ -6,7 +6,7 @@ plan_slug: wait-for-slot-and-retry-classifier level: plan parent_spec: docs/specs/015-router-job-dispatch-failure-recovery.md depends_on: [1-failed-event-lock-compensation.md] -status: pending +status: wip --- # 015/2: Wait-for-slot capacity, retry budget, error classifier From f0ac965f86f070563cce6c61be9e8cefc85025c6 Mon Sep 17 00:00:00 2001 From: Zbigniew Sobiecki Date: Sun, 26 Apr 2026 16:47:20 +0000 Subject: [PATCH 6/7] =?UTF-8?q?feat(router):=20plan=20015/2=20done=20?= =?UTF-8?q?=E2=80=94=20wait-for-slot=20+=20retry=20budget=20+=20classifier?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the lost-job half of spec 015's bug class. Combined with plan 015/1, the silent black-hole failure mode verified live in prod on 2026-04-26 (ucho/MNG-350) is now fully closed. What landed: - `src/router/slot-waiter.ts` (new) — semaphore-style primitive: `acquireSlot({ timeoutMs })` resolves immediately when capacity is below `routerConfig.maxWorkers`, otherwise queues a FIFO waiter with a bounded timeout that rejects with `code: 'SLOT_WAIT_TIMEOUT'`. `slotReleased()` pops the head waiter; `clearAllWaiters()` rejects every pending waiter with `code: 'SHUTDOWN'` on router stop. - `src/router/dispatch-error-classifier.ts` (new) — classifies thrown errors into `'transient'` (Docker socket Node codes, HTTP 429/409, SLOT_WAIT_TIMEOUT, anything unknown — default-to-retry) vs `'terminal'` (TypeError, ZodError, image-not-found-after-fallback). - `src/router/worker-manager.ts` — `guardedSpawn` rewritten: `await acquireSlot(...)` replaces the synchronous capacity throw; on spawn error, terminal errors are wrapped in BullMQ's `UnrecoverableError` so retries skip; transient errors propagate unchanged so BullMQ retries via attempts/backoff. - `src/router/active-workers.ts` — `cleanupWorker` now calls `slotReleased()` exactly once per cleanup, including on the crash path. The existing `if (worker)` guard ensures idempotence. - `src/router/config.ts` — new `slotWaitTimeoutMs` field (default 5min, configurable via `SLOT_WAIT_TIMEOUT_MS`). - `src/router/queue.ts` and `src/queue/client.ts` — both queues now default to `attempts: 4` with `backoff: { type: 'exponential', delay: 5000 }` (~75s total before exhaustion). Terminal errors bypass via `UnrecoverableError`. - `src/router/container-manager.ts` — exports the existing `isImageNotFoundError` predicate so the classifier can reuse it. Test contract change (spec AC #9): The previous `tests/unit/router/worker-manager.test.ts:179` assertion `'processFn throws when at capacity'` is REPLACED (not deleted) with `'processFn awaits a slot when at capacity, then dispatches when one frees'`. The throw-on-capacity contract is gone forever. Tests: - 7 new unit tests in `slot-waiter.test.ts` (FIFO, timeout, no-op, shutdown rejection) - 11 new unit tests in `dispatch-error-classifier.test.ts` covering every transient/terminal class - 4 new unit tests in `worker-manager.test.ts` (replaced original capacity-throw test + 3 for retry classification) - 3 new unit tests in `active-workers.test.ts` for slotReleased integration - 5 new module-integration tests in `dispatch-retry.test.ts` exercise REAL guardedSpawn + REAL slot-waiter + REAL dispatch-error-classifier against both queues, mocking only spawnWorker + BullMQ Worker constructor. Plan 1's 3 module-integration tests continue to pass alongside plan 2's 5. Full unit suite: 8539 passed / 23 skipped / 0 failed. CLAUDE.md updated with a new "Dispatch failure semantics" section documenting the unified contract (capacity wait, retry budget, classifier, three-way decision-reason taxonomy from plan 1, wedged-lock canary). File now 182 lines, under the 200-line cap. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 1 + CLAUDE.md | 13 ++ ...ait-for-slot-and-retry-classifier.md.done} | 36 +-- src/queue/client.ts | 7 +- src/router/active-workers.ts | 5 + src/router/config.ts | 7 + src/router/container-manager.ts | 5 +- src/router/dispatch-error-classifier.ts | 66 ++++++ src/router/queue.ts | 9 +- src/router/slot-waiter.ts | 88 ++++++++ src/router/worker-manager.ts | 44 +++- .../integration/router/dispatch-retry.test.ts | 211 ++++++++++++++++++ tests/unit/router/active-workers.test.ts | 26 +++ .../router/dispatch-error-classifier.test.ts | 66 ++++++ tests/unit/router/slot-waiter.test.ts | 133 +++++++++++ tests/unit/router/worker-manager.test.ts | 114 +++++++++- 16 files changed, 794 insertions(+), 37 deletions(-) rename docs/plans/015-router-job-dispatch-failure-recovery/{2-wait-for-slot-and-retry-classifier.md.wip => 2-wait-for-slot-and-retry-classifier.md.done} (97%) create mode 100644 src/router/dispatch-error-classifier.ts create mode 100644 src/router/slot-waiter.ts create mode 100644 tests/integration/router/dispatch-retry.test.ts create mode 100644 tests/unit/router/dispatch-error-classifier.test.ts create mode 100644 tests/unit/router/slot-waiter.test.ts diff --git a/CHANGELOG.md b/CHANGELOG.md index 66ea8f33..cc05c2f8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,7 @@ All notable user-visible changes to CASCADE are documented here. The format is l ### Changed +- **Router dispatch capacity now waits for a slot; transient Docker errors retry; terminal errors fail fast** (spec 015, plan 2 of 2). Replaces `guardedSpawn`'s synchronous "No worker slots available" throw with an in-process slot-waiter (default 5min timeout, configurable via `SLOT_WAIT_TIMEOUT_MS`). Adds a dispatch-error classifier that splits transient (`ECONNREFUSED` / `ECONNRESET` / `ENOTFOUND` / HTTP 429 / container-name 409 / `SLOT_WAIT_TIMEOUT`) from terminal (`TypeError` / `ZodError` / image-not-found-after-fallback). Both `cascade-jobs` and `cascade-dashboard-jobs` queue defaults now specify `attempts: 4` with `backoff: { type: 'exponential', delay: 5000 }` (~75s total before exhaustion). Terminal errors are wrapped in BullMQ's `UnrecoverableError` so retries skip. Combined with plan 015/1, the original silent black-hole failure mode (verified live on 2026-04-26 via ucho/MNG-350) is fully closed: no more lost jobs on transient capacity misses or Docker hiccups, no more wedged locks. CLAUDE.md updated with the new "Dispatch failure semantics" passage. See [spec 015](docs/specs/015-router-job-dispatch-failure-recovery.md). - **Router dispatch failures now release in-memory locks via the BullMQ failed event** (spec 015, plan 1 of 2). Hooks `worker.on('failed')` on both `cascade-jobs` and `cascade-dashboard-jobs` queues to call a new `releaseLocksForFailedJob` compensator that releases the work-item lock, agent-type concurrency counter, and recently-dispatched dedup mark for any job whose dispatch fails. Closes the stranded-lock half of the prod incident verified on 2026-04-26 (ucho/MNG-350): a transient capacity miss was leaving the in-memory work-item lock wedged for 30 minutes, silently rejecting subsequent webhooks for the same trio. Also splits the webhook decision-reason vocabulary into three states — `Job queued` (success), `Awaiting worker slot: …` (in-flight, healthy), `Work item locked (no active dispatch): …` (wedged-lock canary, fires a Sentry capture tagged `wedged_lock_canary` so any regression in compensation is loud). Plan 2 closes the lost-job half (wait-for-slot, retry budget, error classifier). See [spec 015](docs/specs/015-router-job-dispatch-failure-recovery.md). - **`cascade-tools scm create-pr-review`: `--comment` alias + `--comments-file` escape hatch** (spec 014, plan 2 of 2). The command now accepts `--comment` (singular) as an alias for `--comments` — the exact muscle-memory mistake from prod run 5d993b04 now resolves correctly. Added `--comments-file ` (and `-` for stdin) as a JSON-parsed file alternative for long payloads that don't survive shell quoting. Zero edits to shared infrastructure (cliCommandFactory, manifestGenerator, nativeToolPrompts, errorEnvelope) — the two declarative fields on `createPRReviewDef.parameters.comments.cliAliases` + `createPRReviewDef.cli.fileInputAlternatives` are everything. Proves spec 014's single-entrypoint invariant: a new or evolved gadget should never need to touch shared machinery. See [spec 014](docs/specs/014-cascade-tools-agent-ergonomics.md). - **`cascade-tools` agent ergonomics: truthful system prompt, runnable `--help`, structured error envelope** (spec 014, plan 1 of 2). The system-prompt renderer that describes every cascade-tools command to agents now tells the truth about array-shaped parameters — no more silent `s`-stripping of names, no more ` (repeatable)` claim for array-of-object flags (they correctly render as `-- ''` now, with aliases appended via `|` and a one-line runnable JSON example inlined from the tool definition's `examples` block). Every CLI failure — flag-parse, JSON-parse, missing-required, enum-mismatch, unknown-flag, auth, runtime — emits a single structured envelope on stdout (`{"success":false,"error":{type,flag?,message,got?,expected?,hint?,example?}}`) plus a short prose summary on stderr for humans, replacing the ad-hoc mix of `this.error()` prose and `{success:false,error:""}` flat shapes. Mistyped flags get a "did you mean" suggestion via Levenshtein match against declared canonical names + aliases. `--help` now renders `def.examples` as copy-pasteable shell invocations under an `EXAMPLES` section. Root-caused by prod run 5d993b04-6e05-4ae1-b7de-8c274cf3496b where a review agent wasted ~2½ min fighting the prior pre-014 surface and ultimately dropped an inline PR comment. See [spec 014](docs/specs/014-cascade-tools-agent-ergonomics.md) + authoring guide at [`src/gadgets/README.md`](src/gadgets/README.md). diff --git a/CLAUDE.md b/CLAUDE.md index a39e5e33..40ec0bd0 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -123,6 +123,19 @@ Some triggers take params (e.g. `review` + `scm:check-suite-success` accepts `{" **Worker exit diagnostics** — when a worker container exits non-zero, the router calls `container.inspect()` *before* AutoRemove reaps it and stamps the run record's `error` field with a structured, grep-stable string: `Worker crashed with exit code N · OOMKilled= · reason=""`. The `OOMKilled=true` marker is the definitive cgroup-OOM signal (per Docker's own `State.OOMKilled`); a 137 exit *without* `OOMKilled=true` means the kill came from inside the container or from a non-cgroup signal — *not* memory. The `[WorkerManager] Resolved spawn settings` log emitted at every spawn includes both `projectWatchdogTimeoutMs` and `globalWorkerTimeoutMs` so post-mortems can confirm whether the per-project override actually won. See `src/router/active-workers.ts:formatCrashReason` for the format and `tests/unit/router/container-manager-diagnostics.test.ts` for regression pins. +**Dispatch failure semantics** — spec 015 (verified live in prod via the ucho/MNG-350 incident on 2026-04-26): + +- **Capacity miss waits, never throws.** When the dispatcher pulls a job and the worker pool is at `maxWorkers`, it `await`s a slot via the in-process slot-waiter (default `slotWaitTimeoutMs` = 5min). The slot is conceptually held by the running container — `slotReleased()` is called once per cleanup from `cleanupWorker`, never from the dispatcher. +- **Transient Docker errors retry.** `ECONNREFUSED` / `ECONNRESET` / `ENOTFOUND` on the Docker socket, registry HTTP 429, container-name 409 collisions, and the `SLOT_WAIT_TIMEOUT` itself all classify as transient and propagate unchanged so BullMQ retries via `attempts: 4` + `backoff: { type: 'exponential', delay: 5000 }` (~75s total before exhaustion). Both `cascade-jobs` and `cascade-dashboard-jobs` use the same retry config. +- **Terminal errors fail fast.** `TypeError` / `ZodError` (validation) and image-not-found *after* fallback exhaustion are wrapped in BullMQ's `UnrecoverableError`, which skips the retry budget entirely. +- **Failed-event compensation releases locks.** Every dispatch failure (transient retry exhaustion, terminal error, slot-wait timeout exhaustion) flows through `worker.on('failed')`, which calls `releaseLocksForFailedJob` to release the work-item lock, agent-type counter, and recently-dispatched dedup mark. Without this, the locks leak for ~30min and silently reject every follow-up webhook for the same trio. +- **Webhook decision reasons are three-way.** When the work-item lock check rejects a webhook, the message distinguishes: + - `Job queued: ...` (success — not a lock rejection) + - `Awaiting worker slot: ...` (lock held + dispatch in flight — healthy) + - `Work item locked (no active dispatch): ...` (wedged-lock canary — the lock-state classifier could correlate the lock count with neither an active worker nor a queued/waiting BullMQ job; this fires a Sentry capture tagged `wedged_lock_canary` so any regression in compensation is loud) + +The wedged-lock canary should never fire under normal operation. Its presence in webhook logs or Sentry is itself a regression invariant: a code path acquired a lock without registering its compensation. + ## Review agent — context shape (debugging) Review agent receives a **compact per-file diff context**, not full file contents. Each changed file is a `### (, +N -M)` section with a unified diff hunk. Budget: `REVIEW_DIFF_CONTEXT_TOKEN_LIMIT` = 200k tokens, per-file cap 10%. diff --git a/docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md.wip b/docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md.done similarity index 97% rename from docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md.wip rename to docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md.done index a3d3dc32..a199091a 100644 --- a/docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md.wip +++ b/docs/plans/015-router-job-dispatch-failure-recovery/2-wait-for-slot-and-retry-classifier.md.done @@ -6,7 +6,7 @@ plan_slug: wait-for-slot-and-retry-classifier level: plan parent_spec: docs/specs/015-router-job-dispatch-failure-recovery.md depends_on: [1-failed-event-lock-compensation.md] -status: wip +status: done --- # 015/2: Wait-for-slot capacity, retry budget, error classifier @@ -227,20 +227,20 @@ n/a — all ACs auto-tested. ## Progress -- [ ] AC #1 (slot-waiter primitive) -- [ ] AC #2 (guardedSpawn awaits) -- [ ] AC #3 (cleanupWorker calls slotReleased) -- [ ] AC #4 (classifier taxonomy) -- [ ] AC #5 (queue retry config) -- [ ] AC #6 (UnrecoverableError wrap) -- [ ] AC #7 (end-to-end retry behavior) -- [ ] AC #8 (capacity-throw test replaced) -- [ ] AC #9 (CLAUDE.md updated) -- [ ] AC #10 (CHANGELOG) -- [ ] AC #11 (Plan 1 regression) -- [ ] AC #12 (TDD discipline) -- [ ] AC #13 (build) -- [ ] AC #14 (unit tests) -- [ ] AC #15 (integration) -- [ ] AC #16 (lint) -- [ ] AC #17 (typecheck) +- [x] AC #1 (slot-waiter primitive) +- [x] AC #2 (guardedSpawn awaits) +- [x] AC #3 (cleanupWorker calls slotReleased) +- [x] AC #4 (classifier taxonomy) +- [x] AC #5 (queue retry config) +- [x] AC #6 (UnrecoverableError wrap) +- [x] AC #7 (end-to-end retry behavior) +- [x] AC #8 (capacity-throw test replaced) +- [x] AC #9 (CLAUDE.md updated) +- [x] AC #10 (CHANGELOG) +- [x] AC #11 (Plan 1 regression) +- [x] AC #12 (TDD discipline) +- [x] AC #13 (build) +- [x] AC #14 (unit tests) +- [x] AC #15 (integration) +- [x] AC #16 (lint) +- [x] AC #17 (typecheck) diff --git a/src/queue/client.ts b/src/queue/client.ts index 77268d35..be28d2ca 100644 --- a/src/queue/client.ts +++ b/src/queue/client.ts @@ -55,7 +55,12 @@ function getQueue(): Queue { queue = new Queue(QUEUE_NAME, { connection: parseRedisUrl(redisUrl), defaultJobOptions: { - attempts: 1, + // Spec 015/2: bounded retries on dispatch failures, parity with + // the cascade-jobs queue. Manual-run / retry-run / debug-analysis + // jobs hit the same dispatch path — should benefit from the same + // transient-failure absorption. + attempts: 4, + backoff: { type: 'exponential', delay: 5_000 }, removeOnComplete: { age: 24 * 60 * 60, count: 100 }, removeOnFail: { age: 7 * 24 * 60 * 60 }, }, diff --git a/src/router/active-workers.ts b/src/router/active-workers.ts index 44ccd34a..5d13d242 100644 --- a/src/router/active-workers.ts +++ b/src/router/active-workers.ts @@ -9,6 +9,7 @@ import { failOrphanedRun, failOrphanedRunFallback } from '../db/repositories/run import { logger } from '../utils/logging.js'; import { clearAgentTypeEnqueued } from './agent-type-lock.js'; import type { CascadeJob } from './queue.js'; +import { slotReleased } from './slot-waiter.js'; import { clearWorkItemEnqueued } from './work-item-lock.js'; export interface ActiveWorker { @@ -101,6 +102,10 @@ export function cleanupWorker(jobId: string, exitCode?: number, details?: ExitDe if (worker.projectId && worker.workItemId && worker.agentType) { clearWorkItemEnqueued(worker.projectId, worker.workItemId, worker.agentType); } + // Spec 015/2: free a worker slot so any dispatcher waiting in + // `acquireSlot()` can proceed. Idempotent — the surrounding + // `if (worker)` guard ensures we call this exactly once per cleanup. + slotReleased(); if (exitCode !== undefined && exitCode !== 0 && worker.projectId) { const durationMs = Date.now() - worker.startedAt.getTime(); const reason = formatCrashReason(exitCode, details); diff --git a/src/router/config.ts b/src/router/config.ts index 2fc5dd6b..d8588723 100644 --- a/src/router/config.ts +++ b/src/router/config.ts @@ -31,6 +31,12 @@ export interface RouterConfig { workerImage: string; workerMemoryMb: number; workerTimeoutMs: number; + /** + * How long a dispatcher will wait for a worker slot to free up before + * giving up and surfacing a transient SLOT_WAIT_TIMEOUT error (which + * BullMQ then retries). Spec 015/2. + */ + slotWaitTimeoutMs: number; // Network settings dockerNetwork: string; @@ -134,6 +140,7 @@ export const routerConfig: RouterConfig = { workerImage: process.env.WORKER_IMAGE || 'ghcr.io/mongrel-intelligence/cascade-worker:latest', workerMemoryMb: Number(process.env.WORKER_MEMORY_MB) || 4096, workerTimeoutMs: Number(process.env.WORKER_TIMEOUT_MS) || 30 * 60 * 1000, // 30 minutes + slotWaitTimeoutMs: Number(process.env.SLOT_WAIT_TIMEOUT_MS) || 5 * 60 * 1000, // 5 minutes dockerNetwork: process.env.DOCKER_NETWORK || 'services_default', emailScheduleIntervalMs: Number(process.env.EMAIL_SCHEDULE_INTERVAL_MS) || 5 * 60 * 1000, webhookCallbackBaseUrl: process.env.WEBHOOK_CALLBACK_BASE_URL, diff --git a/src/router/container-manager.ts b/src/router/container-manager.ts index c8be047b..ee75e168 100644 --- a/src/router/container-manager.ts +++ b/src/router/container-manager.ts @@ -338,8 +338,11 @@ export async function resolveSpawnSettings( * Returns true when a Docker error indicates the requested image does not exist. * Uses the HTTP statusCode from dockerode's error objects as the primary signal, * with a substring check on the message as a secondary guard. + * + * Exported for the dispatch-error classifier (spec 015/2) so it can + * recognise this terminal class and skip BullMQ retries for it. */ -function isImageNotFoundError(err: unknown): boolean { +export function isImageNotFoundError(err: unknown): boolean { return ( err != null && typeof err === 'object' && diff --git a/src/router/dispatch-error-classifier.ts b/src/router/dispatch-error-classifier.ts new file mode 100644 index 00000000..ceff8aaa --- /dev/null +++ b/src/router/dispatch-error-classifier.ts @@ -0,0 +1,66 @@ +/** + * Dispatch-error classifier (spec 015/2). + * + * Decide whether a thrown error from `spawnWorker` / `acquireSlot` + * should burn a BullMQ retry attempt (`'transient'`) or skip retries + * by being wrapped in `UnrecoverableError` (`'terminal'`). + * + * Default is `'transient'` — when in doubt, retry. The retry budget + * itself is bounded (4 attempts with exponential backoff), so a true + * bug can't loop forever; it will surface via attempt exhaustion + + * Sentry capture in the failed-event hook. + * + * Recognized terminal classes: + * - validation errors (TypeError, ZodError) + * - image-not-found AFTER the fallback retry has already exhausted + * + * Recognized transient classes: + * - ECONNREFUSED / ECONNRESET / ENOTFOUND on the Docker socket + * - HTTP 429 from the registry (rate limit) + * - HTTP 409 "name already in use" (container-name collision race) + * - SLOT_WAIT_TIMEOUT from the slot-waiter primitive + */ + +import { isImageNotFoundError } from './container-manager.js'; + +export type DispatchErrorKind = 'transient' | 'terminal'; + +interface ErrorWithCode { + code?: unknown; + statusCode?: unknown; + name?: unknown; + message?: unknown; +} + +const TRANSIENT_NODE_CODES = new Set(['ECONNREFUSED', 'ECONNRESET', 'ENOTFOUND', 'ETIMEDOUT']); +const TRANSIENT_SLOT_CODES = new Set(['SLOT_WAIT_TIMEOUT']); +const TRANSIENT_HTTP_STATUS = new Set([429, 409]); + +export function classifyDispatchError(err: unknown): DispatchErrorKind { + if (err == null || typeof err !== 'object') return 'transient'; + + const e = err as ErrorWithCode; + + // Terminal: validation + if (e.name === 'ZodError') return 'terminal'; + if (err instanceof TypeError) return 'terminal'; + + // Terminal: image-not-found AFTER fallback (the spawnWorker path's last + // resort already retried with the base image; if we still got here, the + // base image is genuinely missing). + if (isImageNotFoundError(err)) return 'terminal'; + + // Transient: tagged slot-wait timeout + if (typeof e.code === 'string' && TRANSIENT_SLOT_CODES.has(e.code)) return 'transient'; + + // Transient: socket-level Node errors + if (typeof e.code === 'string' && TRANSIENT_NODE_CODES.has(e.code)) return 'transient'; + + // Transient: known transient HTTP statuses + if (typeof e.statusCode === 'number' && TRANSIENT_HTTP_STATUS.has(e.statusCode)) + return 'transient'; + + // Default-to-retry: unknown shape. Better to burn a retry than to + // silently bury a real bug as terminal — the retry budget caps risk. + return 'transient'; +} diff --git a/src/router/queue.ts b/src/router/queue.ts index b69067e3..3d1ea39a 100644 --- a/src/router/queue.ts +++ b/src/router/queue.ts @@ -76,7 +76,14 @@ export type CascadeJob = TrelloJob | GitHubJob | JiraJob | SentryJob | LinearJob export const jobQueue = new Queue('cascade-jobs', { connection, defaultJobOptions: { - attempts: 1, // No retries - agents handle their own errors + // Spec 015/2: bounded retries on dispatch failures only. Terminal + // errors (validation, image-not-found-after-fallback) bypass via + // `UnrecoverableError`. Agents themselves still handle their own + // internal errors — these attempts apply only to the dispatch path + // (the time between BullMQ pulling the job and the worker + // container *starting*, before the agent is even running). + attempts: 4, + backoff: { type: 'exponential', delay: 5_000 }, removeOnComplete: { age: 24 * 60 * 60, // Keep completed jobs for 24 hours count: 100, // Keep last 100 completed jobs diff --git a/src/router/slot-waiter.ts b/src/router/slot-waiter.ts new file mode 100644 index 00000000..5946a9c9 --- /dev/null +++ b/src/router/slot-waiter.ts @@ -0,0 +1,88 @@ +/** + * In-process slot waiter — semaphore-style backpressure for the dispatcher. + * + * Replaces the old "throw on capacity" pattern from spec 015/2. When the + * dispatcher pulls a job and the worker pool is already at `maxWorkers`, + * the dispatcher awaits a slot up to a bounded timeout. If a slot frees + * (because a running container exits and `cleanupWorker` calls + * `slotReleased`), the waiter resolves and the job dispatches normally. + * If the timeout fires, the waiter rejects with a tagged + * `code: 'SLOT_WAIT_TIMEOUT'` error — the dispatch-error classifier + * recognises this code as transient, so BullMQ's retry budget kicks in. + * + * The slot is conceptually held by the running container, NOT by the + * dispatcher. `slotReleased()` is called once per cleanup from + * `cleanupWorker` (see spec 015/2 plan). The dispatcher does NOT call it. + */ + +import { logger } from '../utils/logging.js'; +import { getActiveWorkerCount } from './active-workers.js'; +import { routerConfig } from './config.js'; + +interface PendingWaiter { + resolve: () => void; + reject: (err: Error) => void; + timeoutHandle: NodeJS.Timeout; +} + +const pending: PendingWaiter[] = []; + +/** + * Wait until the worker pool has capacity, or the timeout fires. + * + * If `getActiveWorkerCount() < routerConfig.maxWorkers`, resolves + * immediately. Otherwise queues a waiter that the next `slotReleased()` + * call will pop. If the waiter sits longer than `timeoutMs`, it rejects + * with `code: 'SLOT_WAIT_TIMEOUT'`. + */ +export function acquireSlot(opts: { timeoutMs: number }): Promise { + if (getActiveWorkerCount() < routerConfig.maxWorkers) { + return Promise.resolve(); + } + + return new Promise((resolve, reject) => { + const entry: PendingWaiter = { + resolve, + reject, + timeoutHandle: setTimeout(() => { + const idx = pending.indexOf(entry); + if (idx >= 0) pending.splice(idx, 1); + const err = new Error( + `Slot wait timed out after ${opts.timeoutMs}ms — worker pool stuck`, + ) as Error & { code: string }; + err.code = 'SLOT_WAIT_TIMEOUT'; + logger.warn('[slot-waiter] timed out', { timeoutMs: opts.timeoutMs }); + reject(err); + }, opts.timeoutMs), + }; + pending.push(entry); + }); +} + +/** + * Pop the head waiter and resolve it. No-op if the queue is empty — + * called every time a worker container exits, regardless of whether + * any dispatcher is currently waiting. + */ +export function slotReleased(): void { + const next = pending.shift(); + if (!next) return; + clearTimeout(next.timeoutHandle); + next.resolve(); +} + +/** + * Reject every pending waiter with `code: 'SHUTDOWN'`. Called on + * router shutdown / detachAll to avoid leaking timers and to surface + * a clear error to in-flight dispatchers. + */ +export function clearAllWaiters(): void { + while (pending.length > 0) { + const entry = pending.shift(); + if (!entry) break; + clearTimeout(entry.timeoutHandle); + const err = new Error('slot-waiter: shutdown') as Error & { code: string }; + err.code = 'SHUTDOWN'; + entry.reject(err); + } +} diff --git a/src/router/worker-manager.ts b/src/router/worker-manager.ts index 11709767..f031c76f 100644 --- a/src/router/worker-manager.ts +++ b/src/router/worker-manager.ts @@ -7,7 +7,7 @@ * Public API is unchanged — all consumers continue importing from this module. */ -import type { Job, Worker } from 'bullmq'; +import { type Job, UnrecoverableError, type Worker } from 'bullmq'; import { logger } from '../utils/logging.js'; import { createQueueWorker, parseRedisUrl } from './bullmq-workers.js'; import { routerConfig } from './config.js'; @@ -19,7 +19,9 @@ import { startOrphanCleanup, stopOrphanCleanup, } from './container-manager.js'; +import { classifyDispatchError } from './dispatch-error-classifier.js'; import type { CascadeJob } from './queue.js'; +import { acquireSlot, clearAllWaiters } from './slot-waiter.js'; import { startSnapshotCleanup, stopSnapshotCleanup } from './snapshot-cleanup.js'; import { syncSnapshotsFromDocker } from './snapshot-startup-sync.js'; @@ -35,18 +37,34 @@ let dashboardWorker: Worker | null = null; // Using a fixed 8-hour value prevents lock expiry for long-running containers. const BULLMQ_LOCK_DURATION_MS = 8 * 60 * 60 * 1000; -/** Guard that enforces the per-router concurrency cap before spawning. */ +/** + * Guard that backpressures the dispatcher to the per-router concurrency cap + * and classifies spawn errors for BullMQ retry policy (spec 015/2). + * + * Capacity miss: `acquireSlot` waits up to `slotWaitTimeoutMs` for a slot + * to free; on timeout it rejects with `code: 'SLOT_WAIT_TIMEOUT'`, which + * the classifier treats as transient so BullMQ retries via attempts/backoff. + * + * Spawn error: a transient error (Docker daemon unreachable, name collision + * race, registry rate-limit) propagates unchanged — BullMQ retries. A + * terminal error (validation, image-not-found-after-fallback) is wrapped in + * `UnrecoverableError` so BullMQ skips the retry budget and the failed-event + * compensator from spec 015/1 runs once at exhaustion. + * + * The slot is conceptually held by the running container, NOT by the + * dispatcher — `slotReleased()` is called from `cleanupWorker` at container + * exit, never from here. + */ async function guardedSpawn(job: Job): Promise { - // Check if we have capacity. - // This shouldn't happen with proper concurrency settings, - // but just in case, throw to retry later. - if (getActiveWorkerCount() >= routerConfig.maxWorkers) { - throw new Error('No worker slots available'); + await acquireSlot({ timeoutMs: routerConfig.slotWaitTimeoutMs }); + try { + await spawnWorker(job); + } catch (err) { + if (classifyDispatchError(err) === 'terminal') { + throw new UnrecoverableError(err instanceof Error ? err.message : String(err)); + } + throw err; } - await spawnWorker(job); - // Note: We don't wait for the container to complete here. - // The job is considered "processed" once the container starts. - // Container exit is handled asynchronously. } export function startWorkerProcessor(): void { @@ -115,5 +133,9 @@ export async function stopWorkerProcessor(): Promise { // watchdog (src/utils/lifecycle.ts) for timeout enforcement. detachAll(); + // Reject any pending slot waiters so they don't leak timers across the + // shutdown. Spec 015/2. + clearAllWaiters(); + logger.info('[WorkerManager] Stopped'); } diff --git a/tests/integration/router/dispatch-retry.test.ts b/tests/integration/router/dispatch-retry.test.ts new file mode 100644 index 00000000..2f446e2d --- /dev/null +++ b/tests/integration/router/dispatch-retry.test.ts @@ -0,0 +1,211 @@ +/** + * Module-integration test for spec 015/2. + * + * Validates the dispatch-path retry contract end-to-end: + * - transient errors propagate unchanged so BullMQ retries + * - terminal errors are wrapped in `UnrecoverableError` so retries skip + * - capacity miss waits for a slot rather than failing immediately + * + * Wires the REAL `guardedSpawn` body (via `createQueueWorker` → + * `processFn`) plus REAL `slot-waiter`, REAL `dispatch-error-classifier`, + * REAL `active-workers`, mocking only `spawnWorker` (so we can simulate + * Docker errors deterministically) and BullMQ's `Worker` constructor (so + * we can drive `failed`/process-fn calls synthetically without a real + * Redis). This is the load-bearing seam from spec 015/2. + * + * The full Redis-driven `attempts: 4 + backoff` retry timing is BullMQ's + * own well-tested behavior; we don't re-test it here. We only verify + * the *classification contract* on our side (transient vs terminal), + * because that's the thing this spec changed. + */ + +import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; + +vi.mock('bullmq', async (importOriginal) => { + const real = (await importOriginal()) as Record; + return { + ...real, + Worker: vi.fn().mockImplementation((_queueName, processFn, _opts) => ({ + on: vi.fn(), + close: vi.fn().mockResolvedValue(undefined), + __processFn: processFn, + })), + }; +}); + +vi.mock('../../../src/router/container-manager.js', () => ({ + spawnWorker: vi.fn(), + getActiveWorkerCount: vi.fn().mockReturnValue(0), + getActiveWorkers: vi.fn().mockReturnValue([]), + detachAll: vi.fn(), + startOrphanCleanup: vi.fn(), + stopOrphanCleanup: vi.fn(), + isImageNotFoundError: vi.fn().mockReturnValue(false), +})); + +// `slot-waiter` reads `getActiveWorkerCount` from `active-workers.js`, NOT +// from `container-manager.js` (despite the re-export). We mock both so the +// capacity-miss test can drive the real slot-waiter into the queued state. +vi.mock('../../../src/router/active-workers.js', () => ({ + getActiveWorkerCount: vi.fn().mockReturnValue(0), +})); + +vi.mock('../../../src/router/snapshot-cleanup.js', () => ({ + startSnapshotCleanup: vi.fn(), + stopSnapshotCleanup: vi.fn(), +})); + +vi.mock('../../../src/router/snapshot-startup-sync.js', () => ({ + syncSnapshotsFromDocker: vi.fn().mockResolvedValue(undefined), +})); + +vi.mock('../../../src/utils/logging.js', () => ({ + logger: { info: vi.fn(), warn: vi.fn(), error: vi.fn(), debug: vi.fn() }, +})); + +vi.mock('../../../src/sentry.js', () => ({ + captureException: vi.fn(), +})); + +import { Worker } from 'bullmq'; +import { getActiveWorkerCount } from '../../../src/router/active-workers.js'; +import { spawnWorker } from '../../../src/router/container-manager.js'; +import { syncSnapshotsFromDocker } from '../../../src/router/snapshot-startup-sync.js'; +import { startWorkerProcessor, stopWorkerProcessor } from '../../../src/router/worker-manager.js'; + +const MockWorker = vi.mocked(Worker); +const mockSpawnWorker = vi.mocked(spawnWorker); +const mockGetActiveWorkerCount = vi.mocked(getActiveWorkerCount); +const mockSyncSnapshots = vi.mocked(syncSnapshotsFromDocker); + +interface FakeWorker { + on: ReturnType; + __processFn: (job: unknown) => Promise; +} + +function getProcessFn(queueName: string): (job: unknown) => Promise { + const call = MockWorker.mock.results.find((_r, i) => { + const args = MockWorker.mock.calls[i]; + return args?.[0] === queueName; + }); + const w = call?.value as FakeWorker | undefined; + if (!w?.__processFn) throw new Error(`processFn not captured for queue ${queueName}`); + return w.__processFn; +} + +describe('spec 015/2: dispatch-path retry classification (module-integration)', () => { + beforeEach(async () => { + await stopWorkerProcessor(); + MockWorker.mockReset(); + MockWorker.mockImplementation( + (_queueName, processFn, _opts) => + ({ + on: vi.fn(), + close: vi.fn().mockResolvedValue(undefined), + __processFn: processFn, + }) as never, + ); + mockSpawnWorker.mockReset(); + mockGetActiveWorkerCount.mockReturnValue(0); + mockSyncSnapshots.mockResolvedValue(undefined); + startWorkerProcessor(); + }); + + afterEach(async () => { + await stopWorkerProcessor(); + vi.restoreAllMocks(); + }); + + it('cascade-jobs: transient spawn error propagates unchanged so BullMQ retries via attempts/backoff', async () => { + const transientErr = Object.assign(new Error('ECONNREFUSED docker.sock'), { + code: 'ECONNREFUSED', + }); + mockSpawnWorker.mockRejectedValueOnce(transientErr); + + const processFn = getProcessFn('cascade-jobs'); + await expect(processFn({ id: 'j1', data: { type: 'linear', projectId: 'p1' } })).rejects.toBe( + transientErr, + ); + }); + + it('cascade-jobs: terminal spawn error is wrapped in UnrecoverableError so BullMQ skips retries', async () => { + // TypeError is one of the terminal classes the dispatch-error + // classifier recognises. Image-not-found is also terminal but + // requires the real `isImageNotFoundError` predicate which is + // stubbed in this test. + const terminalErr = new TypeError("Cannot read 'foo' of undefined"); + mockSpawnWorker.mockRejectedValueOnce(terminalErr); + + const processFn = getProcessFn('cascade-jobs'); + const rejectionSpy = vi.fn(); + await processFn({ id: 'j2', data: { type: 'linear', projectId: 'p1' } }).catch(rejectionSpy); + + expect(rejectionSpy).toHaveBeenCalledTimes(1); + const thrown = rejectionSpy.mock.calls[0][0]; + expect((thrown as Error).name).toBe('UnrecoverableError'); + }); + + it('cascade-dashboard-jobs: transient spawn error propagates unchanged (parity with main queue)', async () => { + const transientErr = Object.assign(new Error('socket hang up'), { code: 'ECONNRESET' }); + mockSpawnWorker.mockRejectedValueOnce(transientErr); + + const processFn = getProcessFn('cascade-dashboard-jobs'); + await expect( + processFn({ + id: 'manual-run-x', + data: { + type: 'manual-run', + projectId: 'p1', + workItemId: 'MNG-1', + agentType: 'implementation', + }, + }), + ).rejects.toBe(transientErr); + }); + + it('cascade-dashboard-jobs: terminal spawn error is wrapped in UnrecoverableError (parity)', async () => { + const terminalErr = new TypeError("Cannot read 'foo' of undefined"); + mockSpawnWorker.mockRejectedValueOnce(terminalErr); + + const processFn = getProcessFn('cascade-dashboard-jobs'); + const rejectionSpy = vi.fn(); + await processFn({ + id: 'manual-run-y', + data: { + type: 'manual-run', + projectId: 'p1', + workItemId: 'MNG-2', + agentType: 'review', + }, + }).catch(rejectionSpy); + + expect(rejectionSpy).toHaveBeenCalledTimes(1); + const thrown = rejectionSpy.mock.calls[0][0]; + expect((thrown as Error).name).toBe('UnrecoverableError'); + }); + + it('cascade-jobs: capacity miss waits for a slot, then dispatches when one frees', async () => { + // Force "at capacity" — use 999 to be safely above any plausible + // maxWorkers (config default 3 in the test env, real default also 3). + mockGetActiveWorkerCount.mockReturnValue(999); + mockSpawnWorker.mockResolvedValueOnce(undefined); + + const processFn = getProcessFn('cascade-jobs'); + const inflight = processFn({ id: 'j-cap', data: { type: 'linear', projectId: 'p1' } }); + + // Before the slot frees, spawnWorker must NOT have been called. + await Promise.resolve(); + await Promise.resolve(); + expect(mockSpawnWorker).not.toHaveBeenCalled(); + + // Free a slot by importing & calling the real `slotReleased`. + const { slotReleased } = await import('../../../src/router/slot-waiter.js'); + // Capacity is still 999 in the mock, but the waiter doesn't re-check + // — it simply pops the head waiter. So slotReleased() unblocks the + // inflight processFn, which proceeds to spawnWorker. + slotReleased(); + + await inflight; + expect(mockSpawnWorker).toHaveBeenCalledTimes(1); + }); +}); diff --git a/tests/unit/router/active-workers.test.ts b/tests/unit/router/active-workers.test.ts index 33293b4e..fd9d5347 100644 --- a/tests/unit/router/active-workers.test.ts +++ b/tests/unit/router/active-workers.test.ts @@ -11,6 +11,7 @@ const { mockClearAllWorkItemLocks, mockClearAgentTypeEnqueued, mockClearAllAgentTypeLocks, + mockSlotReleased, } = vi.hoisted(() => ({ mockFailOrphanedRun: vi.fn().mockResolvedValue(null), mockFailOrphanedRunFallback: vi.fn().mockResolvedValue(null), @@ -18,6 +19,7 @@ const { mockClearAllWorkItemLocks: vi.fn(), mockClearAgentTypeEnqueued: vi.fn(), mockClearAllAgentTypeLocks: vi.fn(), + mockSlotReleased: vi.fn(), })); // --------------------------------------------------------------------------- @@ -39,6 +41,10 @@ vi.mock('../../../src/router/agent-type-lock.js', () => ({ clearAllAgentTypeLocks: (...args: unknown[]) => mockClearAllAgentTypeLocks(...args), })); +vi.mock('../../../src/router/slot-waiter.js', () => ({ + slotReleased: (...args: unknown[]) => mockSlotReleased(...args), +})); + // --------------------------------------------------------------------------- // Imports (after mocks) // --------------------------------------------------------------------------- @@ -86,6 +92,7 @@ describe('active-workers', () => { mockFailOrphanedRunFallback.mockResolvedValue(null); mockClearWorkItemEnqueued.mockClear(); mockClearAgentTypeEnqueued.mockClear(); + mockSlotReleased.mockClear(); }); afterEach(() => { @@ -207,6 +214,25 @@ describe('active-workers', () => { expect(mockClearAgentTypeEnqueued).toHaveBeenCalledWith('proj-1', 'review'); }); + it('calls slotReleased exactly once per cleanup (spec 015/2)', () => { + activeWorkers.set('job-slot', makeActiveWorker({ jobId: 'job-slot' })); + cleanupWorker('job-slot'); + expect(mockSlotReleased).toHaveBeenCalledTimes(1); + }); + + it('calls slotReleased on the crash path (exitCode != 0)', () => { + activeWorkers.set('job-crash', makeActiveWorker({ jobId: 'job-crash', projectId: 'p1' })); + cleanupWorker('job-crash', 137, { oomKilled: true }); + expect(mockSlotReleased).toHaveBeenCalledTimes(1); + }); + + it('does NOT double-call slotReleased on duplicate cleanup invocations', () => { + activeWorkers.set('job-dup', makeActiveWorker({ jobId: 'job-dup' })); + cleanupWorker('job-dup'); + cleanupWorker('job-dup'); // second call: worker already removed + expect(mockSlotReleased).toHaveBeenCalledTimes(1); + }); + it('calls failOrphanedRun on non-zero exit code', () => { mockFailOrphanedRun.mockResolvedValue('run-123'); activeWorkers.set( diff --git a/tests/unit/router/dispatch-error-classifier.test.ts b/tests/unit/router/dispatch-error-classifier.test.ts new file mode 100644 index 00000000..a026f39b --- /dev/null +++ b/tests/unit/router/dispatch-error-classifier.test.ts @@ -0,0 +1,66 @@ +import { describe, expect, it } from 'vitest'; + +import { classifyDispatchError } from '../../../src/router/dispatch-error-classifier.js'; + +describe('classifyDispatchError', () => { + it("Docker daemon unreachable (ECONNREFUSED) → 'transient'", () => { + const err = Object.assign(new Error('connect ECONNREFUSED /var/run/docker.sock'), { + code: 'ECONNREFUSED', + }); + expect(classifyDispatchError(err)).toBe('transient'); + }); + + it("Docker socket reset (ECONNRESET) → 'transient'", () => { + const err = Object.assign(new Error('socket hang up'), { code: 'ECONNRESET' }); + expect(classifyDispatchError(err)).toBe('transient'); + }); + + it("DNS lookup failure (ENOTFOUND) → 'transient'", () => { + const err = Object.assign(new Error('getaddrinfo ENOTFOUND'), { code: 'ENOTFOUND' }); + expect(classifyDispatchError(err)).toBe('transient'); + }); + + it("registry rate-limit (HTTP 429) → 'transient'", () => { + const err = Object.assign(new Error('toomanyrequests: Rate limit'), { statusCode: 429 }); + expect(classifyDispatchError(err)).toBe('transient'); + }); + + it("container name collision (HTTP 409 'name already in use') → 'transient'", () => { + const err = Object.assign( + new Error('(HTTP code 409) The container name "/x" is already in use'), + { statusCode: 409 }, + ); + expect(classifyDispatchError(err)).toBe('transient'); + }); + + it("image not found after fallback (404 + 'no such image') → 'terminal'", () => { + const err = Object.assign(new Error('(HTTP code 404) No such image: foo:latest'), { + statusCode: 404, + }); + expect(classifyDispatchError(err)).toBe('terminal'); + }); + + it("validation error (TypeError) → 'terminal'", () => { + expect(classifyDispatchError(new TypeError("Cannot read 'foo' of undefined"))).toBe('terminal'); + }); + + it("slot-wait timeout (code: 'SLOT_WAIT_TIMEOUT') → 'transient'", () => { + const err = Object.assign(new Error('Slot wait timed out'), { code: 'SLOT_WAIT_TIMEOUT' }); + expect(classifyDispatchError(err)).toBe('transient'); + }); + + it("unknown error (no recognizable shape) → 'transient' (default-to-retry)", () => { + expect(classifyDispatchError(new Error('something weird'))).toBe('transient'); + }); + + it("ZodError-shaped (name='ZodError') → 'terminal'", () => { + const err = Object.assign(new Error('validation failed'), { name: 'ZodError' }); + expect(classifyDispatchError(err)).toBe('terminal'); + }); + + it("non-Error values → 'transient' (default-to-retry, never crash the classifier)", () => { + expect(classifyDispatchError('plain string')).toBe('transient'); + expect(classifyDispatchError(null)).toBe('transient'); + expect(classifyDispatchError(undefined)).toBe('transient'); + }); +}); diff --git a/tests/unit/router/slot-waiter.test.ts b/tests/unit/router/slot-waiter.test.ts new file mode 100644 index 00000000..c46d779f --- /dev/null +++ b/tests/unit/router/slot-waiter.test.ts @@ -0,0 +1,133 @@ +import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'; + +vi.mock('../../../src/router/active-workers.js', () => ({ + getActiveWorkerCount: vi.fn(), +})); + +vi.mock('../../../src/router/config.js', () => ({ + routerConfig: { maxWorkers: 1 }, +})); + +vi.mock('../../../src/utils/logging.js', () => ({ + logger: { info: vi.fn(), warn: vi.fn(), error: vi.fn(), debug: vi.fn() }, +})); + +import { getActiveWorkerCount } from '../../../src/router/active-workers.js'; +import { routerConfig } from '../../../src/router/config.js'; +import { acquireSlot, clearAllWaiters, slotReleased } from '../../../src/router/slot-waiter.js'; + +const mockGetActiveWorkerCount = vi.mocked(getActiveWorkerCount); + +describe('slot-waiter', () => { + beforeEach(() => { + mockGetActiveWorkerCount.mockReset(); + clearAllWaiters(); + // Default: maxWorkers=1 + (routerConfig as { maxWorkers: number }).maxWorkers = 1; + }); + + afterEach(() => { + vi.useRealTimers(); + clearAllWaiters(); + }); + + it('resolves immediately when capacity is below max', async () => { + (routerConfig as { maxWorkers: number }).maxWorkers = 3; + mockGetActiveWorkerCount.mockReturnValue(1); + await expect(acquireSlot({ timeoutMs: 1000 })).resolves.toBeUndefined(); + }); + + it('suspends when at capacity, resolves when a slot frees', async () => { + mockGetActiveWorkerCount.mockReturnValue(1); + const acquired = acquireSlot({ timeoutMs: 5000 }); + + // One microtask turn — promise should still be pending + let settled = false; + void acquired.then(() => { + settled = true; + }); + await Promise.resolve(); + expect(settled).toBe(false); + + // Free up a slot — waiter resolves + slotReleased(); + await acquired; + expect(settled).toBe(true); + }); + + it('rejects with code SLOT_WAIT_TIMEOUT if no slot frees in time', async () => { + vi.useFakeTimers(); + mockGetActiveWorkerCount.mockReturnValue(1); + const acquired = acquireSlot({ timeoutMs: 50 }); + // Capture the rejection so it doesn't bubble as unhandled + const rejectionSpy = vi.fn(); + acquired.catch(rejectionSpy); + + await vi.advanceTimersByTimeAsync(60); + + expect(rejectionSpy).toHaveBeenCalledTimes(1); + const err = rejectionSpy.mock.calls[0][0]; + expect(err).toBeInstanceOf(Error); + expect((err as { code?: string }).code).toBe('SLOT_WAIT_TIMEOUT'); + }); + + it('multiple waiters resolve FIFO as slots free', async () => { + mockGetActiveWorkerCount.mockReturnValue(1); + const order: number[] = []; + const w1 = acquireSlot({ timeoutMs: 5000 }).then(() => order.push(1)); + const w2 = acquireSlot({ timeoutMs: 5000 }).then(() => order.push(2)); + const w3 = acquireSlot({ timeoutMs: 5000 }).then(() => order.push(3)); + + await Promise.resolve(); + expect(order).toEqual([]); + + slotReleased(); + await Promise.resolve(); + await Promise.resolve(); + expect(order).toEqual([1]); + + slotReleased(); + await Promise.resolve(); + await Promise.resolve(); + expect(order).toEqual([1, 2]); + + slotReleased(); + await Promise.all([w1, w2, w3]); + expect(order).toEqual([1, 2, 3]); + }); + + it('slotReleased called with no waiters is a no-op (does not throw)', () => { + expect(() => slotReleased()).not.toThrow(); + }); + + it('slotReleased does not double-release waiters when called multiple times in rapid succession', async () => { + mockGetActiveWorkerCount.mockReturnValue(1); + const acquired = acquireSlot({ timeoutMs: 5000 }); + const resolveSpy = vi.fn(); + acquired.then(resolveSpy); + + slotReleased(); + slotReleased(); // Extra release — must not double-resolve the same waiter + slotReleased(); + + await acquired; + // Microtask drain + await Promise.resolve(); + expect(resolveSpy).toHaveBeenCalledTimes(1); + }); + + it('clearAllWaiters rejects pending waiters with code SHUTDOWN', async () => { + mockGetActiveWorkerCount.mockReturnValue(1); + const acquired = acquireSlot({ timeoutMs: 5000 }); + const rejectionSpy = vi.fn(); + acquired.catch(rejectionSpy); + + clearAllWaiters(); + await Promise.resolve(); + await Promise.resolve(); + + expect(rejectionSpy).toHaveBeenCalledTimes(1); + const err = rejectionSpy.mock.calls[0][0]; + expect((err as { code?: string }).code).toBe('SHUTDOWN'); + }); +}); diff --git a/tests/unit/router/worker-manager.test.ts b/tests/unit/router/worker-manager.test.ts index f08c723f..fe604052 100644 --- a/tests/unit/router/worker-manager.test.ts +++ b/tests/unit/router/worker-manager.test.ts @@ -18,6 +18,24 @@ vi.mock('../../../src/router/container-manager.js', () => ({ stopOrphanCleanup: vi.fn(), })); +vi.mock('../../../src/router/slot-waiter.js', () => ({ + acquireSlot: vi.fn().mockResolvedValue(undefined), + clearAllWaiters: vi.fn(), +})); + +vi.mock('../../../src/router/dispatch-error-classifier.js', () => ({ + classifyDispatchError: vi.fn().mockReturnValue('transient'), +})); + +vi.mock('bullmq', () => ({ + UnrecoverableError: class extends Error { + constructor(message: string) { + super(message); + this.name = 'UnrecoverableError'; + } + }, +})); + vi.mock('../../../src/router/snapshot-cleanup.js', () => ({ startSnapshotCleanup: vi.fn(), stopSnapshotCleanup: vi.fn(), @@ -30,6 +48,7 @@ vi.mock('../../../src/router/config.js', () => ({ workerImage: 'test-worker:latest', workerMemoryMb: 512, workerTimeoutMs: 5000, + slotWaitTimeoutMs: 5 * 60 * 1000, dockerNetwork: 'test-network', }, })); @@ -57,6 +76,8 @@ import { startOrphanCleanup, stopOrphanCleanup, } from '../../../src/router/container-manager.js'; +import { classifyDispatchError } from '../../../src/router/dispatch-error-classifier.js'; +import { acquireSlot } from '../../../src/router/slot-waiter.js'; import { startSnapshotCleanup, stopSnapshotCleanup } from '../../../src/router/snapshot-cleanup.js'; import { startWorkerProcessor, @@ -77,6 +98,8 @@ const mockStopOrphanCleanup = vi.mocked(stopOrphanCleanup); const mockStartSnapshotCleanup = vi.mocked(startSnapshotCleanup); const mockStopSnapshotCleanup = vi.mocked(stopSnapshotCleanup); const mockLogger = vi.mocked(logger); +const mockAcquireSlot = vi.mocked(acquireSlot); +const mockClassifyDispatchError = vi.mocked(classifyDispatchError); // --------------------------------------------------------------------------- // Helpers @@ -176,19 +199,100 @@ describe('startWorkerProcessor', () => { expect(mockSpawnWorker).toHaveBeenCalledWith(fakeJob); }); - it('processFn throws when at capacity', async () => { + // REPLACED in spec 015/2: capacity miss now waits for a slot instead of + // throwing. The previous assertion `processFn throws when at capacity` + // is intentionally gone (per spec AC #9) — preserved here as an + // inverted test pinning the new contract. + it('processFn awaits a slot when at capacity, then dispatches when one frees', async () => { startWorkerProcessor(); const cascadeJobsCall = mockCreateQueueWorker.mock.calls.find( (call) => call[0].queueName === 'cascade-jobs', ); - const processFn = cascadeJobsCall?.[0].processFn; + const processFn = cascadeJobsCall?.[0].processFn as (j: unknown) => Promise; + + // `acquireSlot` resolves once a slot is available — drive that here. + let resolveAcquire: () => void = () => {}; + mockAcquireSlot.mockImplementationOnce( + () => + new Promise((res) => { + resolveAcquire = res; + }), + ); - // At capacity - mockGetActiveWorkerCount.mockReturnValue(3); // equals maxWorkers + mockSpawnWorker.mockClear(); const fakeJob = { id: 'j2', data: { type: 'trello', projectId: 'p1' } }; - await expect(processFn(fakeJob)).rejects.toThrow('No worker slots available'); + const inflight = processFn(fakeJob); + + // Before the slot frees, spawnWorker must NOT have been called. + await Promise.resolve(); expect(mockSpawnWorker).not.toHaveBeenCalled(); + + // Free the slot — processFn proceeds to spawnWorker. + resolveAcquire(); + await inflight; + expect(mockSpawnWorker).toHaveBeenCalledWith(fakeJob); + }); + + it("processFn rejects with code 'SLOT_WAIT_TIMEOUT' when the wait exceeds the timeout", async () => { + startWorkerProcessor(); + + const cascadeJobsCall = mockCreateQueueWorker.mock.calls.find( + (call) => call[0].queueName === 'cascade-jobs', + ); + const processFn = cascadeJobsCall?.[0].processFn as (j: unknown) => Promise; + + const timeoutErr = Object.assign(new Error('Slot wait timed out'), { + code: 'SLOT_WAIT_TIMEOUT', + }); + mockAcquireSlot.mockRejectedValueOnce(timeoutErr); + // Slot timeout classifies as transient → propagates unchanged so + // BullMQ retries via attempts/backoff. + mockClassifyDispatchError.mockReturnValueOnce('transient'); + + mockSpawnWorker.mockClear(); + const fakeJob = { id: 'j2', data: { type: 'trello', projectId: 'p1' } }; + await expect(processFn(fakeJob)).rejects.toMatchObject({ code: 'SLOT_WAIT_TIMEOUT' }); + expect(mockSpawnWorker).not.toHaveBeenCalled(); + }); + + it('processFn propagates a transient spawn error unchanged so BullMQ retries', async () => { + startWorkerProcessor(); + + const cascadeJobsCall = mockCreateQueueWorker.mock.calls.find( + (call) => call[0].queueName === 'cascade-jobs', + ); + const processFn = cascadeJobsCall?.[0].processFn as (j: unknown) => Promise; + + const transientErr = Object.assign(new Error('ECONNREFUSED docker.sock'), { + code: 'ECONNREFUSED', + }); + mockSpawnWorker.mockRejectedValueOnce(transientErr); + mockClassifyDispatchError.mockReturnValueOnce('transient'); + + const fakeJob = { id: 'j3', data: { type: 'trello', projectId: 'p1' } }; + await expect(processFn(fakeJob)).rejects.toBe(transientErr); + }); + + it('processFn wraps a terminal spawn error in UnrecoverableError so retries are skipped', async () => { + startWorkerProcessor(); + + const cascadeJobsCall = mockCreateQueueWorker.mock.calls.find( + (call) => call[0].queueName === 'cascade-jobs', + ); + const processFn = cascadeJobsCall?.[0].processFn as (j: unknown) => Promise; + + const terminalErr = Object.assign(new TypeError("Cannot read 'foo'"), {}); + mockSpawnWorker.mockRejectedValueOnce(terminalErr); + mockClassifyDispatchError.mockReturnValueOnce('terminal'); + + const fakeJob = { id: 'j4', data: { type: 'trello', projectId: 'p1' } }; + const rejectionSpy = vi.fn(); + await processFn(fakeJob).catch(rejectionSpy); + + expect(rejectionSpy).toHaveBeenCalledTimes(1); + const thrown = rejectionSpy.mock.calls[0][0]; + expect((thrown as Error).name).toBe('UnrecoverableError'); }); }); From 29d86f15f2b9a4a34d668d9379269dd9fad9d249 Mon Sep 17 00:00:00 2001 From: Zbigniew Sobiecki Date: Sun, 26 Apr 2026 16:48:03 +0000 Subject: [PATCH 7/7] =?UTF-8?q?chore(spec):=20015=20done=20=E2=80=94=20rou?= =?UTF-8?q?ter=20job=20dispatch=20failure=20recovery,=20all=20plans=20comp?= =?UTF-8?q?lete?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350). Plan 1 added failed-event lock compensation + three-way decision-reason taxonomy; plan 2 replaced the throw-on-capacity with wait-for-slot, added bounded retry with exponential backoff, and introduced a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) --- ...very.md => 015-router-job-dispatch-failure-recovery.md.done} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename docs/specs/{015-router-job-dispatch-failure-recovery.md => 015-router-job-dispatch-failure-recovery.md.done} (99%) diff --git a/docs/specs/015-router-job-dispatch-failure-recovery.md b/docs/specs/015-router-job-dispatch-failure-recovery.md.done similarity index 99% rename from docs/specs/015-router-job-dispatch-failure-recovery.md rename to docs/specs/015-router-job-dispatch-failure-recovery.md.done index cbcf519c..e28996ce 100644 --- a/docs/specs/015-router-job-dispatch-failure-recovery.md +++ b/docs/specs/015-router-job-dispatch-failure-recovery.md.done @@ -4,7 +4,7 @@ slug: router-job-dispatch-failure-recovery level: spec title: Router job dispatch failure recovery created: 2026-04-26 -status: draft +status: done --- # 015: Router job dispatch failure recovery