mongrel-intelligence · zbigniewsobiecki · Apr 26, 2026 · Apr 26, 2026 · Apr 26, 2026 · Apr 26, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,8 @@ All notable user-visible changes to CASCADE are documented here. The format is l
 
 ### Changed
 
+- **Router dispatch capacity now waits for a slot; transient Docker errors retry; terminal errors fail fast** (spec 015, plan 2 of 2). Replaces `guardedSpawn`'s synchronous "No worker slots available" throw with an in-process slot-waiter (default 5min timeout, configurable via `SLOT_WAIT_TIMEOUT_MS`). Adds a dispatch-error classifier that splits transient (`ECONNREFUSED` / `ECONNRESET` / `ENOTFOUND` / HTTP 429 / container-name 409 / `SLOT_WAIT_TIMEOUT`) from terminal (`TypeError` / `ZodError` / image-not-found-after-fallback). Both `cascade-jobs` and `cascade-dashboard-jobs` queue defaults now specify `attempts: 4` with `backoff: { type: 'exponential', delay: 5000 }` (~75s total before exhaustion). Terminal errors are wrapped in BullMQ's `UnrecoverableError` so retries skip. Combined with plan 015/1, the original silent black-hole failure mode (verified live on 2026-04-26 via ucho/MNG-350) is fully closed: no more lost jobs on transient capacity misses or Docker hiccups, no more wedged locks. CLAUDE.md updated with the new "Dispatch failure semantics" passage. See [spec 015](docs/specs/015-router-job-dispatch-failure-recovery.md).
+- **Router dispatch failures now release in-memory locks via the BullMQ failed event** (spec 015, plan 1 of 2). Hooks `worker.on('failed')` on both `cascade-jobs` and `cascade-dashboard-jobs` queues to call a new `releaseLocksForFailedJob` compensator that releases the work-item lock, agent-type concurrency counter, and recently-dispatched dedup mark for any job whose dispatch fails. Closes the stranded-lock half of the prod incident verified on 2026-04-26 (ucho/MNG-350): a transient capacity miss was leaving the in-memory work-item lock wedged for 30 minutes, silently rejecting subsequent webhooks for the same trio. Also splits the webhook decision-reason vocabulary into three states — `Job queued` (success), `Awaiting worker slot: …` (in-flight, healthy), `Work item locked (no active dispatch): …` (wedged-lock canary, fires a Sentry capture tagged `wedged_lock_canary` so any regression in compensation is loud). Plan 2 closes the lost-job half (wait-for-slot, retry budget, error classifier). See [spec 015](docs/specs/015-router-job-dispatch-failure-recovery.md).
 - **`cascade-tools scm create-pr-review`: `--comment` alias + `--comments-file` escape hatch** (spec 014, plan 2 of 2). The command now accepts `--comment` (singular) as an alias for `--comments` — the exact muscle-memory mistake from prod run 5d993b04 now resolves correctly. Added `--comments-file <path>` (and `-` for stdin) as a JSON-parsed file alternative for long payloads that don't survive shell quoting. Zero edits to shared infrastructure (cliCommandFactory, manifestGenerator, nativeToolPrompts, errorEnvelope) — the two declarative fields on `createPRReviewDef.parameters.comments.cliAliases` + `createPRReviewDef.cli.fileInputAlternatives` are everything. Proves spec 014's single-entrypoint invariant: a new or evolved gadget should never need to touch shared machinery. See [spec 014](docs/specs/014-cascade-tools-agent-ergonomics.md).
 - **`cascade-tools` agent ergonomics: truthful system prompt, runnable `--help`, structured error envelope** (spec 014, plan 1 of 2). The system-prompt renderer that describes every cascade-tools command to agents now tells the truth about array-shaped parameters — no more silent `s`-stripping of names, no more `<string> (repeatable)` claim for array-of-object flags (they correctly render as `--<flag> '<json>'` now, with aliases appended via `|` and a one-line runnable JSON example inlined from the tool definition's `examples` block). Every CLI failure — flag-parse, JSON-parse, missing-required, enum-mismatch, unknown-flag, auth, runtime — emits a single structured envelope on stdout (`{"success":false,"error":{type,flag?,message,got?,expected?,hint?,example?}}`) plus a short prose summary on stderr for humans, replacing the ad-hoc mix of `this.error()` prose and `{success:false,error:"<string>"}` flat shapes. Mistyped flags get a "did you mean" suggestion via Levenshtein match against declared canonical names + aliases. `--help` now renders `def.examples` as copy-pasteable shell invocations under an `EXAMPLES` section. Root-caused by prod run 5d993b04-6e05-4ae1-b7de-8c274cf3496b where a review agent wasted ~2½ min fighting the prior pre-014 surface and ultimately dropped an inline PR comment. See [spec 014](docs/specs/014-cascade-tools-agent-ergonomics.md) + authoring guide at [`src/gadgets/README.md`](src/gadgets/README.md).
 - **`cascade-tools` now streams subprocess output live** (spec 013). The shared subprocess helper (on top of `execa` + `tree-kill`) forwards child stdout/stderr to the parent's stderr line-by-line as it arrives, emits a heartbeat line on stderr every 30 seconds of child silence (configurable), enforces both an idle-silence timeout (default 120s) and a wall-clock timeout (default 600s) with SIGTERM→SIGKILL escalation, and kills the full process tree on timeout. `git push` and `git commit` invoked by `scm create-pr` pass tighter per-caller timeouts and now return captured hook output in the result on success (previously discarded). Result shape is backward-compatible — `{ stdout, stderr, exitCode }` preserved; new optional `reason: 'idle-timeout' | 'wall-timeout'` surfaces when the helper killed the child. Motivation: LLM-driven CASCADE agents watching an output file could not distinguish a slow pre-push hook (~60s of silence) from a hung process, leading to retry loops that burned 5–10+ minutes of run budget. See [spec 013](docs/specs/013-subprocess-output-streaming.md).

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -123,6 +123,19 @@ Some triggers take params (e.g. `review` + `scm:check-suite-success` accepts `{"
 
 **Worker exit diagnostics** — when a worker container exits non-zero, the router calls `container.inspect()` *before* AutoRemove reaps it and stamps the run record's `error` field with a structured, grep-stable string: `Worker crashed with exit code N · OOMKilled=<true|false> · reason="<State.Error>"`. The `OOMKilled=true` marker is the definitive cgroup-OOM signal (per Docker's own `State.OOMKilled`); a 137 exit *without* `OOMKilled=true` means the kill came from inside the container or from a non-cgroup signal — *not* memory. The `[WorkerManager] Resolved spawn settings` log emitted at every spawn includes both `projectWatchdogTimeoutMs` and `globalWorkerTimeoutMs` so post-mortems can confirm whether the per-project override actually won. See `src/router/active-workers.ts:formatCrashReason` for the format and `tests/unit/router/container-manager-diagnostics.test.ts` for regression pins.
 
+**Dispatch failure semantics** — spec 015 (verified live in prod via the ucho/MNG-350 incident on 2026-04-26):
+
+- **Capacity miss waits, never throws.** When the dispatcher pulls a job and the worker pool is at `maxWorkers`, it `await`s a slot via the in-process slot-waiter (default `slotWaitTimeoutMs` = 5min). The slot is conceptually held by the running container — `slotReleased()` is called once per cleanup from `cleanupWorker`, never from the dispatcher.
+- **Transient Docker errors retry.** `ECONNREFUSED` / `ECONNRESET` / `ENOTFOUND` on the Docker socket, registry HTTP 429, container-name 409 collisions, and the `SLOT_WAIT_TIMEOUT` itself all classify as transient and propagate unchanged so BullMQ retries via `attempts: 4` + `backoff: { type: 'exponential', delay: 5000 }` (~75s total before exhaustion). Both `cascade-jobs` and `cascade-dashboard-jobs` use the same retry config.
+- **Terminal errors fail fast.** `TypeError` / `ZodError` (validation) and image-not-found *after* fallback exhaustion are wrapped in BullMQ's `UnrecoverableError`, which skips the retry budget entirely.
+- **Failed-event compensation releases locks.** Every dispatch failure (transient retry exhaustion, terminal error, slot-wait timeout exhaustion) flows through `worker.on('failed')`, which calls `releaseLocksForFailedJob` to release the work-item lock, agent-type counter, and recently-dispatched dedup mark. Without this, the locks leak for ~30min and silently reject every follow-up webhook for the same trio.
+- **Webhook decision reasons are three-way.** When the work-item lock check rejects a webhook, the message distinguishes:
+  - `Job queued: ...` (success — not a lock rejection)
+  - `Awaiting worker slot: ...` (lock held + dispatch in flight — healthy)
+  - `Work item locked (no active dispatch): ...` (wedged-lock canary — the lock-state classifier could correlate the lock count with neither an active worker nor a queued/waiting BullMQ job; this fires a Sentry capture tagged `wedged_lock_canary` so any regression in compensation is loud)
+
+The wedged-lock canary should never fire under normal operation. Its presence in webhook logs or Sentry is itself a regression invariant: a code path acquired a lock without registering its compensation.
+
 ## Review agent — context shape (debugging)
 
 Review agent receives a **compact per-file diff context**, not full file contents. Each changed file is a `### <file> (<status>, +N -M)` section with a unified diff hunk. Budget: `REVIEW_DIFF_CONTEXT_TOKEN_LIMIT` = 200k tokens, per-file cap 10%.