Skip to content

Merge dev → main (deps + spec 015)#1208

Merged
zbigniewsobiecki merged 6 commits intomainfrom
dev
Apr 26, 2026
Merged

Merge dev → main (deps + spec 015)#1208
zbigniewsobiecki merged 6 commits intomainfrom
dev

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Carries to prod:

Worker-spawn smoke check on prod required after deploy: dockerode 5.0 is on the worker-spawn path.

🤖 Generated with Claude Code

aaight and others added 6 commits April 26, 2026 18:59
…-to-* agents (#1201)

* fix(triggers): audit & fix PM feedback inconsistencies across respond-to-* agents

* fix(triggers): use case-insensitive JIRA status comparison in isInPlanningStatus

Match the established pattern from status-changed.ts and label-added.ts which
both use .toLowerCase() for JIRA status comparisons, since status names are
user-configurable and the API does not guarantee consistent casing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Cascade Bot <bot@cascade.dev>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…downloadAttachment (#1202)

Co-authored-by: Cascade Bot <bot@cascade.dev>
…native-binary probe (#1206)

The agent-harness SDK bump in #1197 (claude-agent-sdk 0.2.91 → 0.2.119)
broke every review run on cascade-prod with:

  ReferenceError: Claude Code native binary not found at
  /app/node_modules/@anthropic-ai/claude-agent-sdk-linux-x64-musl/claude

The new SDK probes its own platform-specific optional-dependency
subpackages for a bundled `claude` binary. Two failure modes hit at once:

  1. Cascade installs `@anthropic-ai/claude-code@2.1.119` globally at
     /usr/local/bin/claude — the SDK never looks there.
  2. The SDK probes the `-musl` variant first regardless of host libc and
     errors on ENOENT instead of falling through to the glibc variant.

Pass an explicit `pathToClaudeCodeExecutable` to short-circuit the probe.
The resolver checks (in order):

  - $CLAUDE_CODE_EXECUTABLE_PATH env override (local-dev escape hatch)
  - `which claude` in $PATH
  - /usr/local/bin/claude (Docker default from Dockerfile.worker)

Two TDD tests pin the option onto query() and prove the env override
wins. No Dockerfile change needed; the existing global install at
/usr/local/bin/claude becomes the resolver's runtime target.

Confirmed broken on ucho PR #72 (cascade-prod review agent crash).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(spec/plans): add spec 015 + plans for router dispatch failure recovery

Spec captures the silent black-hole bug class verified live on 2026-04-26
(ucho/MNG-350): a transient capacity miss or Docker error during worker
spawn turns a webhook-driven job into a permanently failed BullMQ entry
while stranding the work-item / agent-type locks for up to 30 minutes,
silently rejecting subsequent webhooks for the same work item.

Decomposed into two plans with safety-net-first sequencing: plan 1 hooks
the BullMQ failed event to release locks on every dispatch failure path;
plan 2 replaces the throw-on-capacity with a wait-for-slot semaphore,
adds bounded retry with exponential backoff, and a transient/terminal
error classifier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(plan): lock 015/1 failed-event-lock-compensation

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(router): plan 015/1 done — release locks on dispatch failure

Closes the stranded-lock half of spec 015's bug class verified live in
prod on 2026-04-26 (ucho/MNG-350). When a webhook-driven job's dispatch
fails — capacity throw, Docker spawn error, or any future throw site —
the work-item lock, agent-type concurrency counter, and recently-dispatched
dedup mark established by the webhook → enqueue path are now released by
a compensator hooked to BullMQ's `worker.on('failed')` event.

What landed:

- `src/router/dispatch-compensator.ts` (new) — `releaseLocksForFailedJob`
  wraps `extractProjectIdFromJob` / `extractWorkItemId` / `extractAgentType`
  and calls into `clearWorkItemEnqueued` / `clearAgentTypeEnqueued` /
  `clearRecentlyDispatched`. Never propagates errors; captures to Sentry
  with `tags: { source: 'dispatch_compensator' }`.
- `src/router/agent-type-lock.ts` — exports new `clearRecentlyDispatched`
  for the compensator. The existing `markRecentlyDispatched` semantics
  are unchanged (60s TTL, NOT cleared on completion); this helper exists
  solely so a permanently-failed dispatch doesn't keep deduping a fresh
  webhook for ~60s while the user retries.
- `src/router/bullmq-workers.ts` — extends the existing
  `worker.on('failed')` handler to invoke `releaseLocksForFailedJob`
  alongside the existing logger + Sentry calls. Wraps the call in a
  defensive `.catch` so a future regression in the compensator can't
  poison the worker.
- `src/router/lock-state-classifier.ts` (new) — `classifyLockState` returns
  `'awaiting-slot'` when an active worker or queued/waiting job matches
  the trio, `'wedged'` when neither correlation matches. Defaults to
  `'awaiting-slot'` on classifier error so a Redis blip doesn't mis-emit
  the wedged canary.
- `src/router/active-workers.ts` — `getActiveWorkers()` now exposes
  `(projectId, workItemId, agentType)` so the classifier can correlate.
  Backwards-compatible (existing callers work unchanged; new fields are
  additive optional).
- `src/router/webhook-processor.ts` — Step 8 (work-item lock check) now
  splits the decision-reason vocabulary into three states:
    * `Job queued: ...` (success path)
    * `Awaiting worker slot: ...` (lock held + dispatch in flight; healthy)
    * `Work item locked (no active dispatch): ...` (wedged-lock canary)
  The wedged branch additionally fires `captureException` with
  `tags: { source: 'wedged_lock_canary' }` so any regression in
  compensation is loud in production.

What this does NOT change (intentional, all in plan 015/2):
- `guardedSpawn` still throws on capacity (BullMQ marks the job failed,
  the compensator now releases the locks, but the job itself is still
  lost). Plan 2 replaces the throw with a wait-for-slot semaphore.
- Both queues still default to `attempts: 1`. Plan 2 raises this with
  exponential backoff and adds a transient/terminal error classifier.
- CLAUDE.md is intentionally not updated by this plan — the unified
  passage describing both halves of the new contract lands in plan 015/2.

Tests:
- 5 new unit tests in `dispatch-compensator.test.ts`
- 3 new unit tests in `agent-type-lock.test.ts` for `clearRecentlyDispatched`
- 4 new unit tests in `bullmq-workers.test.ts` for the failed-event seam
- 5 new unit tests in `lock-state-classifier.test.ts`
- 2 new unit tests in `active-workers.test.ts` for the extended shape
- 4 new unit tests in `webhook-processor.test.ts` for the three-way taxonomy
- 3 new module-integration tests in
  `tests/integration/router/dispatch-failure-compensation.test.ts` exercise
  the real lock modules + real bullmq-workers.ts failed-event handler +
  real compensator end-to-end (only BullMQ's Worker constructor + the
  worker-env extractors are mocked).

Full suite: 8515 passed / 23 skipped / 0 failed. Lint + typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(plan): lock 015/2 wait-for-slot-and-retry-classifier

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(plan): mark 015/2 status: wip

* feat(router): plan 015/2 done — wait-for-slot + retry budget + classifier

Closes the lost-job half of spec 015's bug class. Combined with plan 015/1,
the silent black-hole failure mode verified live in prod on 2026-04-26
(ucho/MNG-350) is now fully closed.

What landed:

- `src/router/slot-waiter.ts` (new) — semaphore-style primitive:
  `acquireSlot({ timeoutMs })` resolves immediately when capacity is
  below `routerConfig.maxWorkers`, otherwise queues a FIFO waiter with
  a bounded timeout that rejects with `code: 'SLOT_WAIT_TIMEOUT'`.
  `slotReleased()` pops the head waiter; `clearAllWaiters()` rejects
  every pending waiter with `code: 'SHUTDOWN'` on router stop.
- `src/router/dispatch-error-classifier.ts` (new) — classifies thrown
  errors into `'transient'` (Docker socket Node codes, HTTP 429/409,
  SLOT_WAIT_TIMEOUT, anything unknown — default-to-retry) vs
  `'terminal'` (TypeError, ZodError, image-not-found-after-fallback).
- `src/router/worker-manager.ts` — `guardedSpawn` rewritten:
  `await acquireSlot(...)` replaces the synchronous capacity throw;
  on spawn error, terminal errors are wrapped in BullMQ's
  `UnrecoverableError` so retries skip; transient errors propagate
  unchanged so BullMQ retries via attempts/backoff.
- `src/router/active-workers.ts` — `cleanupWorker` now calls
  `slotReleased()` exactly once per cleanup, including on the crash
  path. The existing `if (worker)` guard ensures idempotence.
- `src/router/config.ts` — new `slotWaitTimeoutMs` field (default
  5min, configurable via `SLOT_WAIT_TIMEOUT_MS`).
- `src/router/queue.ts` and `src/queue/client.ts` — both queues now
  default to `attempts: 4` with `backoff: { type: 'exponential', delay: 5000 }`
  (~75s total before exhaustion). Terminal errors bypass via
  `UnrecoverableError`.
- `src/router/container-manager.ts` — exports the existing
  `isImageNotFoundError` predicate so the classifier can reuse it.

Test contract change (spec AC #9):

The previous `tests/unit/router/worker-manager.test.ts:179` assertion
`'processFn throws when at capacity'` is REPLACED (not deleted) with
`'processFn awaits a slot when at capacity, then dispatches when one
frees'`. The throw-on-capacity contract is gone forever.

Tests:

- 7 new unit tests in `slot-waiter.test.ts` (FIFO, timeout, no-op,
  shutdown rejection)
- 11 new unit tests in `dispatch-error-classifier.test.ts` covering
  every transient/terminal class
- 4 new unit tests in `worker-manager.test.ts` (replaced original
  capacity-throw test + 3 for retry classification)
- 3 new unit tests in `active-workers.test.ts` for slotReleased
  integration
- 5 new module-integration tests in `dispatch-retry.test.ts` exercise
  REAL guardedSpawn + REAL slot-waiter + REAL dispatch-error-classifier
  against both queues, mocking only spawnWorker + BullMQ Worker
  constructor.

Plan 1's 3 module-integration tests continue to pass alongside plan 2's 5.
Full unit suite: 8539 passed / 23 skipped / 0 failed.

CLAUDE.md updated with a new "Dispatch failure semantics" section
documenting the unified contract (capacity wait, retry budget,
classifier, three-way decision-reason taxonomy from plan 1, wedged-lock
canary). File now 182 lines, under the 200-line cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(spec): 015 done — router job dispatch failure recovery, all plans complete

Closes the silent black-hole bug class verified live on 2026-04-26
(ucho/MNG-350). Plan 1 added failed-event lock compensation +
three-way decision-reason taxonomy; plan 2 replaced the throw-on-capacity
with wait-for-slot, added bounded retry with exponential backoff, and
introduced a transient/terminal error classifier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps [postcss](https://github.com/postcss/postcss) from 8.5.8 to 8.5.12.
- [Release notes](https://github.com/postcss/postcss/releases)
- [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md)
- [Commits](postcss/postcss@8.5.8...8.5.12)

---
updated-dependencies:
- dependency-name: postcss
  dependency-version: 8.5.12
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Removes [uuid](https://github.com/uuidjs/uuid). It's no longer used after updating ancestor dependencies [uuid](https://github.com/uuidjs/uuid), [bullmq](https://github.com/taskforcesh/bullmq) and [dockerode](https://github.com/apocas/dockerode). These dependencies need to be updated together.


Removes `uuid`

Updates `bullmq` from 5.72.0 to 5.76.2
- [Release notes](https://github.com/taskforcesh/bullmq/releases)
- [Commits](taskforcesh/bullmq@v5.72.0...v5.76.2)

Updates `dockerode` from 4.0.10 to 5.0.0
- [Release notes](https://github.com/apocas/dockerode/releases)
- [Commits](apocas/dockerode@v4.0.10...v5.0.0)

---
updated-dependencies:
- dependency-name: uuid
  dependency-version: 
  dependency-type: indirect
- dependency-name: bullmq
  dependency-version: 5.76.2
  dependency-type: direct:production
- dependency-name: dockerode
  dependency-version: 5.0.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@zbigniewsobiecki zbigniewsobiecki merged commit 5352887 into main Apr 26, 2026
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants