Skip to content
Closed
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,11 @@ All notable user-visible changes to CASCADE are documented here. The format is l

### Changed

- **PM image delivery: Linear GraphQL fixture + extraction-coverage regression test** (spec 016, plan 3 of 3). Captures a reconstructed Linear `Issue` GraphQL payload at `tests/fixtures/linear-issue-with-screenshot.json` containing extension-less and extensioned inline-pasted images (description + comment bodies) plus formal Attachment records (Slack/GitHub/Sentry link previews) that must NOT be mistaken for inline images. The unit test at `tests/unit/pm/linear/extraction-coverage.test.ts` pins the contract and fails loudly with a specific URL-missing message if Linear ever changes its payload shape in a way that loses inline images. Documents the conclusion in `src/integrations/README.md`: `Issue.description` markdown is canonical for Linear inline images; `Issue.attachments` is the wrong surface (formal Attachment records, not pastes). No production code change — this plan ships the regression net for the contract Plans 1+2 established. See [spec 016](docs/specs/016-pm-image-delivery-reliability.md).
- **PM image delivery: runtime `cascade-tools pm read-work-item` gadget now delivers images on disk** (spec 016, plan 2 of 3). The runtime gadget that agents call mid-run for a work item used to return text only — its "Pre-fetched Images" section listed URL refs but no local file paths, so an agent that needed to re-read a work item (e.g. after a teammate added a screenshot) had no way to actually see the new image. After this plan, the gadget downloads any image media present and writes it to `.cascade/context/images/work-item-<id>-img-<index>.<ext>` (extension derived from the resolved Content-Type MIME), then returns text whose new "Local Image Files" section lists actual file paths the agent's file-read tool can consume. Failed downloads are surfaced in a "Failed Image Downloads" subsection so they're never silently dropped. Same diagnostic log line as the boot path (`[image-pipeline] work-item-fetch summary`) — operators see consistent shape across boot and runtime fetches. Closes the mid-run pickup gap. See [spec 016](docs/specs/016-pm-image-delivery-reliability.md).
- **PM image delivery: extension-less Linear pasted-image URLs are no longer dropped at the pre-download MIME filter** (spec 016, plan 1 of 3). Linear's `https://uploads.linear.app/<uuid>` URLs (with no file extension in the pathname) used to fall through `mimeTypeFromUrl` to `application/octet-stream` and were silently filtered out by `filterImageMedia` before the download loop ran. The fix introduces an `image/*` wildcard sentinel for trusted PM-provider upload hosts (allowlisted by hostname); `isImageMimeType` now accepts the wildcard, and the download response's `Content-Type` header resolves it to a concrete MIME (`image/png`, etc.) before any image is written. The shared `downloadAndPrepareImages` helper consolidates the per-provider download dispatch (jira/linear/trello) so both the boot-path and the runtime gadget (spec 016 plan 2) share one code path. Adds AC#5's grep-stable diagnostic line — `[image-pipeline] work-item-fetch summary` — emitted once per work-item-fetch with stable fields (`provider`, `workItemId`, `urlsDetected`, `urlsAfterFilter`, `urlsDownloaded`, `urlsFailed`, `urlsByMimeType`). Closes the silent screenshot-drop bug class verified live on 2026-04-26 (ucho/MNG-357). See [spec 016](docs/specs/016-pm-image-delivery-reliability.md).
- **Router dispatch capacity now waits for a slot; transient Docker errors retry; terminal errors fail fast** (spec 015, plan 2 of 2). Replaces `guardedSpawn`'s synchronous "No worker slots available" throw with an in-process slot-waiter (default 5min timeout, configurable via `SLOT_WAIT_TIMEOUT_MS`). Adds a dispatch-error classifier that splits transient (`ECONNREFUSED` / `ECONNRESET` / `ENOTFOUND` / HTTP 429 / container-name 409 / `SLOT_WAIT_TIMEOUT`) from terminal (`TypeError` / `ZodError` / image-not-found-after-fallback). Both `cascade-jobs` and `cascade-dashboard-jobs` queue defaults now specify `attempts: 4` with `backoff: { type: 'exponential', delay: 5000 }` (~75s total before exhaustion). Terminal errors are wrapped in BullMQ's `UnrecoverableError` so retries skip. Combined with plan 015/1, the original silent black-hole failure mode (verified live on 2026-04-26 via ucho/MNG-350) is fully closed: no more lost jobs on transient capacity misses or Docker hiccups, no more wedged locks. CLAUDE.md updated with the new "Dispatch failure semantics" passage. See [spec 015](docs/specs/015-router-job-dispatch-failure-recovery.md).
- **Router dispatch failures now release in-memory locks via the BullMQ failed event** (spec 015, plan 1 of 2). Hooks `worker.on('failed')` on both `cascade-jobs` and `cascade-dashboard-jobs` queues to call a new `releaseLocksForFailedJob` compensator that releases the work-item lock, agent-type concurrency counter, and recently-dispatched dedup mark for any job whose dispatch fails. Closes the stranded-lock half of the prod incident verified on 2026-04-26 (ucho/MNG-350): a transient capacity miss was leaving the in-memory work-item lock wedged for 30 minutes, silently rejecting subsequent webhooks for the same trio. Also splits the webhook decision-reason vocabulary into three states — `Job queued` (success), `Awaiting worker slot: …` (in-flight, healthy), `Work item locked (no active dispatch): …` (wedged-lock canary, fires a Sentry capture tagged `wedged_lock_canary` so any regression in compensation is loud). Plan 2 closes the lost-job half (wait-for-slot, retry budget, error classifier). See [spec 015](docs/specs/015-router-job-dispatch-failure-recovery.md).
- **`cascade-tools scm create-pr-review`: `--comment` alias + `--comments-file` escape hatch** (spec 014, plan 2 of 2). The command now accepts `--comment` (singular) as an alias for `--comments` — the exact muscle-memory mistake from prod run 5d993b04 now resolves correctly. Added `--comments-file <path>` (and `-` for stdin) as a JSON-parsed file alternative for long payloads that don't survive shell quoting. Zero edits to shared infrastructure (cliCommandFactory, manifestGenerator, nativeToolPrompts, errorEnvelope) — the two declarative fields on `createPRReviewDef.parameters.comments.cliAliases` + `createPRReviewDef.cli.fileInputAlternatives` are everything. Proves spec 014's single-entrypoint invariant: a new or evolved gadget should never need to touch shared machinery. See [spec 014](docs/specs/014-cascade-tools-agent-ergonomics.md).
- **`cascade-tools` agent ergonomics: truthful system prompt, runnable `--help`, structured error envelope** (spec 014, plan 1 of 2). The system-prompt renderer that describes every cascade-tools command to agents now tells the truth about array-shaped parameters — no more silent `s`-stripping of names, no more `<string> (repeatable)` claim for array-of-object flags (they correctly render as `--<flag> '<json>'` now, with aliases appended via `|` and a one-line runnable JSON example inlined from the tool definition's `examples` block). Every CLI failure — flag-parse, JSON-parse, missing-required, enum-mismatch, unknown-flag, auth, runtime — emits a single structured envelope on stdout (`{"success":false,"error":{type,flag?,message,got?,expected?,hint?,example?}}`) plus a short prose summary on stderr for humans, replacing the ad-hoc mix of `this.error()` prose and `{success:false,error:"<string>"}` flat shapes. Mistyped flags get a "did you mean" suggestion via Levenshtein match against declared canonical names + aliases. `--help` now renders `def.examples` as copy-pasteable shell invocations under an `EXAMPLES` section. Root-caused by prod run 5d993b04-6e05-4ae1-b7de-8c274cf3496b where a review agent wasted ~2½ min fighting the prior pre-014 surface and ultimately dropped an inline PR comment. See [spec 014](docs/specs/014-cascade-tools-agent-ergonomics.md) + authoring guide at [`src/gadgets/README.md`](src/gadgets/README.md).
- **`cascade-tools` now streams subprocess output live** (spec 013). The shared subprocess helper (on top of `execa` + `tree-kill`) forwards child stdout/stderr to the parent's stderr line-by-line as it arrives, emits a heartbeat line on stderr every 30 seconds of child silence (configurable), enforces both an idle-silence timeout (default 120s) and a wall-clock timeout (default 600s) with SIGTERM→SIGKILL escalation, and kills the full process tree on timeout. `git push` and `git commit` invoked by `scm create-pr` pass tighter per-caller timeouts and now return captured hook output in the result on success (previously discarded). Result shape is backward-compatible — `{ stdout, stderr, exitCode }` preserved; new optional `reason: 'idle-timeout' | 'wall-timeout'` surfaces when the helper killed the child. Motivation: LLM-driven CASCADE agents watching an output file could not distinguish a slow pre-push hook (~60s of silence) from a hung process, leading to retry loops that burned 5–10+ minutes of run budget. See [spec 013](docs/specs/013-subprocess-output-streaming.md).
Expand Down
13 changes: 13 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,19 @@ Some triggers take params (e.g. `review` + `scm:check-suite-success` accepts `{"

**Worker exit diagnostics** — when a worker container exits non-zero, the router calls `container.inspect()` *before* AutoRemove reaps it and stamps the run record's `error` field with a structured, grep-stable string: `Worker crashed with exit code N · OOMKilled=<true|false> · reason="<State.Error>"`. The `OOMKilled=true` marker is the definitive cgroup-OOM signal (per Docker's own `State.OOMKilled`); a 137 exit *without* `OOMKilled=true` means the kill came from inside the container or from a non-cgroup signal — *not* memory. The `[WorkerManager] Resolved spawn settings` log emitted at every spawn includes both `projectWatchdogTimeoutMs` and `globalWorkerTimeoutMs` so post-mortems can confirm whether the per-project override actually won. See `src/router/active-workers.ts:formatCrashReason` for the format and `tests/unit/router/container-manager-diagnostics.test.ts` for regression pins.

**Dispatch failure semantics** — spec 015 (verified live in prod via the ucho/MNG-350 incident on 2026-04-26):

- **Capacity miss waits, never throws.** When the dispatcher pulls a job and the worker pool is at `maxWorkers`, it `await`s a slot via the in-process slot-waiter (default `slotWaitTimeoutMs` = 5min). The slot is conceptually held by the running container — `slotReleased()` is called once per cleanup from `cleanupWorker`, never from the dispatcher.
- **Transient Docker errors retry.** `ECONNREFUSED` / `ECONNRESET` / `ENOTFOUND` on the Docker socket, registry HTTP 429, container-name 409 collisions, and the `SLOT_WAIT_TIMEOUT` itself all classify as transient and propagate unchanged so BullMQ retries via `attempts: 4` + `backoff: { type: 'exponential', delay: 5000 }` (~75s total before exhaustion). Both `cascade-jobs` and `cascade-dashboard-jobs` use the same retry config.
- **Terminal errors fail fast.** `TypeError` / `ZodError` (validation) and image-not-found *after* fallback exhaustion are wrapped in BullMQ's `UnrecoverableError`, which skips the retry budget entirely.
- **Failed-event compensation releases locks.** Every dispatch failure (transient retry exhaustion, terminal error, slot-wait timeout exhaustion) flows through `worker.on('failed')`, which calls `releaseLocksForFailedJob` to release the work-item lock, agent-type counter, and recently-dispatched dedup mark. Without this, the locks leak for ~30min and silently reject every follow-up webhook for the same trio.
- **Webhook decision reasons are three-way.** When the work-item lock check rejects a webhook, the message distinguishes:
- `Job queued: ...` (success — not a lock rejection)
- `Awaiting worker slot: ...` (lock held + dispatch in flight — healthy)
- `Work item locked (no active dispatch): ...` (wedged-lock canary — the lock-state classifier could correlate the lock count with neither an active worker nor a queued/waiting BullMQ job; this fires a Sentry capture tagged `wedged_lock_canary` so any regression in compensation is loud)

The wedged-lock canary should never fire under normal operation. Its presence in webhook logs or Sentry is itself a regression invariant: a code path acquired a lock without registering its compensation.

## Review agent — context shape (debugging)

Review agent receives a **compact per-file diff context**, not full file contents. Each changed file is a `### <file> (<status>, +N -M)` section with a unified diff hunk. Budget: `REVIEW_DIFF_CONTEXT_TOKEN_LIMIT` = 200k tokens, per-file cap 10%.
Expand Down
Loading
Loading