Merge dev → main (deps + spec 015) by zbigniewsobiecki · Pull Request #1208 · mongrel-intelligence/cascade

zbigniewsobiecki · 2026-04-26T19:07:06Z

Carries to prod:

chore(deps): bump postcss from 8.5.8 to 8.5.12 #1204 — postcss 8.5.8 → 8.5.12 (root devDep; security: file-read fix + XSS fix)
chore(deps): bump uuid, bullmq and dockerode #1192 — uuid removed (transitive), bullmq 5.72 → 5.76, dockerode 4.0.10 → 5.0.0 (major; only breaking change is node ≥14.17 — we're on node 22)
spec 015: router job dispatch failure recovery #1203 — spec 015: router job dispatch failure recovery

Worker-spawn smoke check on prod required after deploy: dockerode 5.0 is on the worker-spawn path.

🤖 Generated with Claude Code

…-to-* agents (#1201) * fix(triggers): audit & fix PM feedback inconsistencies across respond-to-* agents * fix(triggers): use case-insensitive JIRA status comparison in isInPlanningStatus Match the established pattern from status-changed.ts and label-added.ts which both use .toLowerCase() for JIRA status comparisons, since status names are user-configurable and the API does not guarantee consistent casing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…downloadAttachment (#1202) Co-authored-by: Cascade Bot <bot@cascade.dev>

…native-binary probe (#1206) The agent-harness SDK bump in #1197 (claude-agent-sdk 0.2.91 → 0.2.119) broke every review run on cascade-prod with: ReferenceError: Claude Code native binary not found at /app/node_modules/@anthropic-ai/claude-agent-sdk-linux-x64-musl/claude The new SDK probes its own platform-specific optional-dependency subpackages for a bundled `claude` binary. Two failure modes hit at once: 1. Cascade installs `@anthropic-ai/claude-code@2.1.119` globally at /usr/local/bin/claude — the SDK never looks there. 2. The SDK probes the `-musl` variant first regardless of host libc and errors on ENOENT instead of falling through to the glibc variant. Pass an explicit `pathToClaudeCodeExecutable` to short-circuit the probe. The resolver checks (in order): - $CLAUDE_CODE_EXECUTABLE_PATH env override (local-dev escape hatch) - `which claude` in $PATH - /usr/local/bin/claude (Docker default from Dockerfile.worker) Two TDD tests pin the option onto query() and prove the env override wins. No Dockerfile change needed; the existing global install at /usr/local/bin/claude becomes the resolver's runtime target. Confirmed broken on ucho PR #72 (cascade-prod review agent crash). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(spec/plans): add spec 015 + plans for router dispatch failure recovery Spec captures the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350): a transient capacity miss or Docker error during worker spawn turns a webhook-driven job into a permanently failed BullMQ entry while stranding the work-item / agent-type locks for up to 30 minutes, silently rejecting subsequent webhooks for the same work item. Decomposed into two plans with safety-net-first sequencing: plan 1 hooks the BullMQ failed event to release locks on every dispatch failure path; plan 2 replaces the throw-on-capacity with a wait-for-slot semaphore, adds bounded retry with exponential backoff, and a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 015/1 failed-event-lock-compensation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(router): plan 015/1 done — release locks on dispatch failure Closes the stranded-lock half of spec 015's bug class verified live in prod on 2026-04-26 (ucho/MNG-350). When a webhook-driven job's dispatch fails — capacity throw, Docker spawn error, or any future throw site — the work-item lock, agent-type concurrency counter, and recently-dispatched dedup mark established by the webhook → enqueue path are now released by a compensator hooked to BullMQ's `worker.on('failed')` event. What landed: - `src/router/dispatch-compensator.ts` (new) — `releaseLocksForFailedJob` wraps `extractProjectIdFromJob` / `extractWorkItemId` / `extractAgentType` and calls into `clearWorkItemEnqueued` / `clearAgentTypeEnqueued` / `clearRecentlyDispatched`. Never propagates errors; captures to Sentry with `tags: { source: 'dispatch_compensator' }`. - `src/router/agent-type-lock.ts` — exports new `clearRecentlyDispatched` for the compensator. The existing `markRecentlyDispatched` semantics are unchanged (60s TTL, NOT cleared on completion); this helper exists solely so a permanently-failed dispatch doesn't keep deduping a fresh webhook for ~60s while the user retries. - `src/router/bullmq-workers.ts` — extends the existing `worker.on('failed')` handler to invoke `releaseLocksForFailedJob` alongside the existing logger + Sentry calls. Wraps the call in a defensive `.catch` so a future regression in the compensator can't poison the worker. - `src/router/lock-state-classifier.ts` (new) — `classifyLockState` returns `'awaiting-slot'` when an active worker or queued/waiting job matches the trio, `'wedged'` when neither correlation matches. Defaults to `'awaiting-slot'` on classifier error so a Redis blip doesn't mis-emit the wedged canary. - `src/router/active-workers.ts` — `getActiveWorkers()` now exposes `(projectId, workItemId, agentType)` so the classifier can correlate. Backwards-compatible (existing callers work unchanged; new fields are additive optional). - `src/router/webhook-processor.ts` — Step 8 (work-item lock check) now splits the decision-reason vocabulary into three states: * `Job queued: ...` (success path) * `Awaiting worker slot: ...` (lock held + dispatch in flight; healthy) * `Work item locked (no active dispatch): ...` (wedged-lock canary) The wedged branch additionally fires `captureException` with `tags: { source: 'wedged_lock_canary' }` so any regression in compensation is loud in production. What this does NOT change (intentional, all in plan 015/2): - `guardedSpawn` still throws on capacity (BullMQ marks the job failed, the compensator now releases the locks, but the job itself is still lost). Plan 2 replaces the throw with a wait-for-slot semaphore. - Both queues still default to `attempts: 1`. Plan 2 raises this with exponential backoff and adds a transient/terminal error classifier. - CLAUDE.md is intentionally not updated by this plan — the unified passage describing both halves of the new contract lands in plan 015/2. Tests: - 5 new unit tests in `dispatch-compensator.test.ts` - 3 new unit tests in `agent-type-lock.test.ts` for `clearRecentlyDispatched` - 4 new unit tests in `bullmq-workers.test.ts` for the failed-event seam - 5 new unit tests in `lock-state-classifier.test.ts` - 2 new unit tests in `active-workers.test.ts` for the extended shape - 4 new unit tests in `webhook-processor.test.ts` for the three-way taxonomy - 3 new module-integration tests in `tests/integration/router/dispatch-failure-compensation.test.ts` exercise the real lock modules + real bullmq-workers.ts failed-event handler + real compensator end-to-end (only BullMQ's Worker constructor + the worker-env extractors are mocked). Full suite: 8515 passed / 23 skipped / 0 failed. Lint + typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 015/2 wait-for-slot-and-retry-classifier Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): mark 015/2 status: wip * feat(router): plan 015/2 done — wait-for-slot + retry budget + classifier Closes the lost-job half of spec 015's bug class. Combined with plan 015/1, the silent black-hole failure mode verified live in prod on 2026-04-26 (ucho/MNG-350) is now fully closed. What landed: - `src/router/slot-waiter.ts` (new) — semaphore-style primitive: `acquireSlot({ timeoutMs })` resolves immediately when capacity is below `routerConfig.maxWorkers`, otherwise queues a FIFO waiter with a bounded timeout that rejects with `code: 'SLOT_WAIT_TIMEOUT'`. `slotReleased()` pops the head waiter; `clearAllWaiters()` rejects every pending waiter with `code: 'SHUTDOWN'` on router stop. - `src/router/dispatch-error-classifier.ts` (new) — classifies thrown errors into `'transient'` (Docker socket Node codes, HTTP 429/409, SLOT_WAIT_TIMEOUT, anything unknown — default-to-retry) vs `'terminal'` (TypeError, ZodError, image-not-found-after-fallback). - `src/router/worker-manager.ts` — `guardedSpawn` rewritten: `await acquireSlot(...)` replaces the synchronous capacity throw; on spawn error, terminal errors are wrapped in BullMQ's `UnrecoverableError` so retries skip; transient errors propagate unchanged so BullMQ retries via attempts/backoff. - `src/router/active-workers.ts` — `cleanupWorker` now calls `slotReleased()` exactly once per cleanup, including on the crash path. The existing `if (worker)` guard ensures idempotence. - `src/router/config.ts` — new `slotWaitTimeoutMs` field (default 5min, configurable via `SLOT_WAIT_TIMEOUT_MS`). - `src/router/queue.ts` and `src/queue/client.ts` — both queues now default to `attempts: 4` with `backoff: { type: 'exponential', delay: 5000 }` (~75s total before exhaustion). Terminal errors bypass via `UnrecoverableError`. - `src/router/container-manager.ts` — exports the existing `isImageNotFoundError` predicate so the classifier can reuse it. Test contract change (spec AC #9): The previous `tests/unit/router/worker-manager.test.ts:179` assertion `'processFn throws when at capacity'` is REPLACED (not deleted) with `'processFn awaits a slot when at capacity, then dispatches when one frees'`. The throw-on-capacity contract is gone forever. Tests: - 7 new unit tests in `slot-waiter.test.ts` (FIFO, timeout, no-op, shutdown rejection) - 11 new unit tests in `dispatch-error-classifier.test.ts` covering every transient/terminal class - 4 new unit tests in `worker-manager.test.ts` (replaced original capacity-throw test + 3 for retry classification) - 3 new unit tests in `active-workers.test.ts` for slotReleased integration - 5 new module-integration tests in `dispatch-retry.test.ts` exercise REAL guardedSpawn + REAL slot-waiter + REAL dispatch-error-classifier against both queues, mocking only spawnWorker + BullMQ Worker constructor. Plan 1's 3 module-integration tests continue to pass alongside plan 2's 5. Full unit suite: 8539 passed / 23 skipped / 0 failed. CLAUDE.md updated with a new "Dispatch failure semantics" section documenting the unified contract (capacity wait, retry budget, classifier, three-way decision-reason taxonomy from plan 1, wedged-lock canary). File now 182 lines, under the 200-line cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(spec): 015 done — router job dispatch failure recovery, all plans complete Closes the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350). Plan 1 added failed-event lock compensation + three-way decision-reason taxonomy; plan 2 replaced the throw-on-capacity with wait-for-slot, added bounded retry with exponential backoff, and introduced a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bumps [postcss](https://github.com/postcss/postcss) from 8.5.8 to 8.5.12. - [Release notes](https://github.com/postcss/postcss/releases) - [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md) - [Commits](postcss/postcss@8.5.8...8.5.12) --- updated-dependencies: - dependency-name: postcss dependency-version: 8.5.12 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Removes [uuid](https://github.com/uuidjs/uuid). It's no longer used after updating ancestor dependencies [uuid](https://github.com/uuidjs/uuid), [bullmq](https://github.com/taskforcesh/bullmq) and [dockerode](https://github.com/apocas/dockerode). These dependencies need to be updated together. Removes `uuid` Updates `bullmq` from 5.72.0 to 5.76.2 - [Release notes](https://github.com/taskforcesh/bullmq/releases) - [Commits](taskforcesh/bullmq@v5.72.0...v5.76.2) Updates `dockerode` from 4.0.10 to 5.0.0 - [Release notes](https://github.com/apocas/dockerode/releases) - [Commits](apocas/dockerode@v4.0.10...v5.0.0) --- updated-dependencies: - dependency-name: uuid dependency-version: dependency-type: indirect - dependency-name: bullmq dependency-version: 5.76.2 dependency-type: direct:production - dependency-name: dockerode dependency-version: 5.0.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

codecov · 2026-04-26T19:11:31Z

Codecov Report

❌ Patch coverage is 90.76087% with 34 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/triggers/github/utils.ts	42.85%	12 Missing ⚠️
src/triggers/jira/comment-mention.ts	89.74%	4 Missing ⚠️
src/triggers/shared/pm-ack.ts	0.00%	4 Missing ⚠️
src/triggers/trello/comment-mention.ts	60.00%	4 Missing ⚠️
src/router/lock-state-classifier.ts	93.54%	1 Missing and 1 partial ⚠️
src/router/queue.ts	0.00%	2 Missing ⚠️
src/pm/linear/adapter.ts	94.11%	0 Missing and 1 partial ⚠️
src/router/bullmq-workers.ts	93.33%	1 Missing ⚠️
src/router/dispatch-compensator.ts	96.29%	1 Missing ⚠️
src/router/slot-waiter.ts	97.61%	1 Missing ⚠️
... and 2 more

📢 Thoughts on this report? Let us know!

aaight and others added 6 commits April 26, 2026 18:59

fix(linear): populate inlineMedia from descriptions/comments and add …

cfabdb5

…downloadAttachment (#1202) Co-authored-by: Cascade Bot <bot@cascade.dev>

zbigniewsobiecki merged commit 5352887 into main Apr 26, 2026
16 of 17 checks passed

zbigniewsobiecki mentioned this pull request Apr 26, 2026

Merge dev → main: Linear comment-mention fix + spec 015/016 + deps #1212

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge dev → main (deps + spec 015)#1208

Merge dev → main (deps + spec 015)#1208
zbigniewsobiecki merged 6 commits intomainfrom
dev

zbigniewsobiecki commented Apr 26, 2026

Uh oh!

codecov Bot commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zbigniewsobiecki commented Apr 26, 2026

Uh oh!

codecov Bot commented Apr 26, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants