spec 016: PM image delivery reliability by zbigniewsobiecki · Pull Request #1209 · mongrel-intelligence/cascade

zbigniewsobiecki · 2026-04-26T19:45:05Z

Summary

Closes the silent screenshot-drop bug class verified live in prod on 2026-04-26 (ucho/MNG-357). A user attached a screenshot to Linear card MNG-357; the planning agent ran on Codex and reported back that "the image asset was not present under .cascade/context/images/ in this workspace" — completely bypassing the user-attached visual context.

Root cause (3 layers stacked):

Linear's user-pasted-image URLs are extension-less (https://uploads.linear.app/<uuid>); mimeTypeFromUrl returned application/octet-stream; filterImageMedia dropped them as non-images BEFORE the download loop ran.
The runtime cascade-tools pm read-work-item gadget was text-only — even if upstream had delivered, an agent re-reading the work item mid-run had no recovery path.
The instrumentation only logged post-download outcomes — the upstream extract/filter step was invisible, making this exact incident class undiagnosable without source-diving.

This was a black-hole pattern just like spec 015's wedged-lock incident: the agent silently failed in a way that looked broken to the user but produced no actionable diagnostics. Three plans landed under spec 016 with safety-net-first sequencing:

Plan 1 (b71b0b95) — boot-path MIME fix + diagnostic log line. Defers MIME authority to the download response's Content-Type header via an image/* wildcard sentinel for trusted PM upload hosts (currently uploads.linear.app); isImageMimeType accepts the wildcard. New shared downloadAndPrepareImages helper consolidates per-provider download dispatch (Plan 2 imports it). Adds the AC#5 grep-stable diagnostic line [image-pipeline] work-item-fetch summary with stable fields. Independently fixes MNG-357.
Plan 2 (f96b99cb) — runtime gadget image delivery. The cascade-tools pm read-work-item gadget now downloads any image media and writes it to .cascade/context/images/work-item-<id>-img-<index>.<ext>; the gadget's text response includes a new "Local Image Files" section listing actual paths. Closes the mid-run pickup gap. Same diagnostic log line as boot path.
Plan 3 (1de2a150) — Linear GraphQL fixture + extraction-coverage regression test. Reconstructed payload at tests/fixtures/linear-issue-with-screenshot.json covering extension-less + extensioned + external + comment + Attachment-records-not-mistaken-for-images. Test fails LOUDLY with a specific URL-missing message if Linear ever changes its payload shape. Documents the conclusion in src/integrations/README.md: Issue.description markdown is canonical; Issue.attachments is the wrong surface for inline images.

PR #948's Claude-Code initial-input ImageBlockParam path is untouched — existing regression test (claude-code.test.ts:939) confirms.

Doc impact: CHANGELOG.md gets one entry per plan. src/integrations/README.md gains a new "Image delivery contract" top-level section (Plan 1) plus a "Linear: GraphQL surface for inline images" subsection (Plan 3). CLAUDE.md untouched — already covered by spec 015's broader silent-failure → diagnostic-line pattern.

Test plan

CI green: `npm run lint`, `npm run typecheck`, `npm test`, `npm run test:integration` all pass
Spot-check the new tests: `npx vitest run --project unit-core tests/unit/pm/media.test.ts tests/unit/pm/download-and-prepare.test.ts tests/unit/gadgets/pm/core/writeRuntimeImages.test.ts tests/unit/gadgets/pm/core/readWorkItem.test.ts tests/unit/pm/linear/extraction-coverage.test.ts tests/unit/agents/definitions/contextSteps.test.ts` (locally: 50+ new tests across these files all pass)
Spot-check integration: `npx vitest run --project integration tests/integration/pm/image-pipeline.test.ts tests/integration/gadgets/runtime-image-delivery.test.ts` (7 module-integration scenarios, all green locally)
Inspect the new "Image delivery contract" section in `src/integrations/README.md` for accuracy
After merge: drag a screenshot to a fresh Linear test issue, trigger a planning run, confirm `.cascade/context/images/work-item--img-0.png` appears in the worker AND that the cascade run log shows `hasOffloadedContext: true` plus the new `[image-pipeline] work-item-fetch summary` INFO line with non-zero `urlsDownloaded`
After merge: monitor production webhook logs for the new diagnostic line (`grep '\[image-pipeline\]'`) — operators triaging "no image delivered" reports should now have one-line answers

Files changed

New modules (3): `src/pm/download-and-prepare.ts`, `src/gadgets/pm/core/writeRuntimeImages.ts`, plus the `IMAGE_HOST_ALLOWLIST` extension to `src/pm/media.ts`
Modified (4): `src/pm/media.ts`, `src/agents/definitions/contextSteps.ts`, `src/gadgets/pm/core/readWorkItem.ts`, `src/integrations/README.md`
New tests (6): unit tests per new module + 2 module-integration tests + Linear extraction-coverage test
Extended tests (2): `tests/unit/pm/media.test.ts` (wildcard + Linear URL coverage), `tests/unit/agents/definitions/contextSteps.test.ts` (diagnostic-log assertions)
New fixture: `tests/fixtures/linear-issue-with-screenshot.json`
Docs: `CHANGELOG.md`, `src/integrations/README.md`

🤖 Generated with Claude Code

Closes the silent screenshot-drop bug class verified live on 2026-04-26 (ucho/MNG-357): Linear's user-pasted-image URLs (uploads.linear.app/<uuid> with no file extension) were dropped at the pre-download MIME filter because mimeTypeFromUrl returned 'application/octet-stream' and filterImageMedia excluded them. This affected all engines on the disk-write path, regardless of PR #948's Claude-Code SDK delivery fix. Three plans, safety-net-first sequencing matching spec 015: - Plan 1 (boot-path-mime-fix-and-diagnostic-log): defers MIME authority to download response Content-Type via image/* wildcard sentinel; adds the grep-stable diagnostic log line at extract time. Independently fixes MNG-357. - Plan 2 (runtime-gadget-image-delivery): makes the runtime cascade-tools pm read-work-item gadget actually download + write images to disk with file paths returned in text. Closes the mid-run pickup gap. Depends on Plan 1's shared download-and-prepare helper. - Plan 3 (linear-fixture-and-extraction-coverage): captures a Linear GraphQL Issue payload fixture for an issue with a pasted screenshot; pins extraction with a regression test that fails loudly if Linear ever changes the payload shape. Mostly tests + docs. 9 ACs, 0 manual-only. CLAUDE.md not updated (already covered by spec 015's silent-failure → diagnostic-line pattern). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the silent screenshot-drop bug class verified live on 2026-04-26 (ucho/MNG-357). Linear's extension-less pasted-image URLs (uploads.linear.app/<uuid>) now survive the pre-download MIME filter via an image/* wildcard sentinel. The download response's Content-Type header is the authoritative MIME — wildcard is resolved before bytes are written. What landed: - src/pm/media.ts — new IMAGE_HOST_ALLOWLIST (currently 'uploads.linear.app'); mimeTypeFromUrl returns 'image/*' for extension-less URLs from allowlisted hosts; isImageMimeType accepts the wildcard. - src/pm/download-and-prepare.ts (new) — shared helper for the per-provider download dispatch loop (jira/linear/trello). Returns { images, failures }. Spec 016/2's runtime gadget will import this. - src/agents/definitions/contextSteps.ts — fetchWorkItemStep refactored to use the shared helper; emits the new grep-stable diagnostic line '[image-pipeline] work-item-fetch summary' with stable fields: { provider, workItemId, urlsDetected, urlsAfterFilter, urlsDownloaded, urlsFailed, urlsByMimeType }. Tests: - 6 new unit tests in tests/unit/pm/media.test.ts (wildcard sentinel, Linear extension-less, regression for extensioned + non-PM URLs) - 7 new unit tests in tests/unit/pm/download-and-prepare.test.ts - 3 new diagnostic-log tests in contextSteps.test.ts; existing log message expectations updated to the new helper-prefix - 3 module-integration tests in tests/integration/pm/image-pipeline.test.ts pinning the MNG-357 reproduction end-to-end with real mimeTypeFromUrl + filterImageMedia + extractMarkdownImages PR #948's Claude-Code initial-input ImageBlockParam path is unchanged; existing regression test (claude-code.test.ts:939 'logs image injection and strips images before buildTaskPrompt') confirms. Docs: - CHANGELOG.md entry under Unreleased. - src/integrations/README.md gains a new 'Image delivery contract' section documenting the shared resolution path, allowlist semantics, diagnostic log line schema, and the rule that providers shouldn't write their own MIME-detection. Full unit suite: 8521 passed / 23 skipped / 0 failed. Lint + typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the mid-run image pickup gap from spec 016. The runtime gadget `cascade-tools pm read-work-item` now downloads any image media and writes it to .cascade/context/images/work-item-<id>-img-<index>.<ext>, returning text whose new "Local Image Files" section lists actual file paths the agent's file-read tool can consume. What landed: - src/gadgets/pm/core/writeRuntimeImages.ts (new) — writes ContextImage arrays to .cascade/context/images/ with stable naming convention (work-item-<id>-img-<i>.<ext>); extension derived from resolved MIME; falls back to .bin + warn log for unresolved image/* sentinel. - src/gadgets/pm/core/readWorkItem.ts — readWorkItem now calls downloadAndPrepareImages (Plan 1's helper) + writeRuntimeImages (this plan), then mutates the returned text to include the local file paths via formatRuntimeImagePaths. Same diagnostic log line '[image-pipeline] work-item-fetch summary' as the boot path. Failed downloads surface in a "Failed Image Downloads" subsection. Tests: - 8 new unit tests in tests/unit/gadgets/pm/core/writeRuntimeImages.test.ts - 5 new unit tests in tests/unit/gadgets/pm/core/readWorkItem.test.ts (spec 016/2 sub-describe) - 4 new module-integration tests in tests/integration/gadgets/runtime-image-delivery.test.ts pinning the mid-run pickup contract end-to-end. CHANGELOG.md entry added. Full unit suite (single-fork): 8534 passed / 23 skipped / 0 failed. Lint + typecheck clean. Three PM manifest test suites occasionally time out under parallel load on this machine — verified to pass in isolation; not a code regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ession Closes spec 016 with the regression net for the contract Plans 1+2 established. If Linear ever changes its Issue payload shape in a way that loses inline images, the extraction-coverage test fails loudly with a specific URL-missing message. What landed: - tests/fixtures/linear-issue-with-screenshot.json (new) — reconstructed Linear GraphQL Issue payload covering: extension-less uploads.linear.app URL in description, extensioned Linear URL with alt text, external URL with image/svg+xml MIME, non-image markdown link (must NOT be picked up), one comment with a pasted screenshot, one comment without, and three formal Attachment records (Slack/GitHub/Sentry link previews). - tests/unit/pm/linear/extraction-coverage.test.ts (new) — 9 tests: description coverage with explicit expected-URL list, image/* sentinel for extension-less, concrete MIME for extensioned, image/svg+xml for external SVG, non-image link exclusion, comment coverage, comment source field, attachment-NOT-leaked rule, meta-test of regression net. - src/integrations/README.md — new "Linear: GraphQL surface for inline images" subsection documenting the conclusion: Issue.description markdown is canonical for inline-pasted screenshots; Issue.attachments is for formal Attachment records (link previews) and is the wrong surface for inline images. Links to the fixture and the test. No production code change — Plan 1's mimeTypeFromUrl + extractMarkdownImages already cover the cases. This plan ships the regression armor. CHANGELOG.md entry added. Lint + typecheck clean. 9/9 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lete Closes the silent screenshot-drop bug class verified live on 2026-04-26 (ucho/MNG-357). Plan 1 added the Linear-extension-less MIME wildcard sentinel + diagnostic log line; plan 2 made the runtime cascade-tools pm read-work-item gadget actually deliver images on disk; plan 3 captured a Linear GraphQL fixture and pinned extraction coverage with a regression test. CLAUDE.md untouched by this spec — already covered by spec 015's broader silent-failure → diagnostic-line pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-04-26T19:51:43Z

Codecov Report

❌ Patch coverage is 92.10526% with 18 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/gadgets/pm/core/readWorkItem.ts	86.36%	9 Missing ⚠️
src/gadgets/pm/core/writeRuntimeImages.ts	88.23%	8 Missing ⚠️
src/pm/download-and-prepare.ts	98.21%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

nhopeatall

Summary

Well-architected fix for a genuine production bug (MNG-357: Linear screenshots silently dropped). The three-plan decomposition is clean — MIME wildcard sentinel for trusted hosts, shared download helper, runtime gadget delivery, fixture-based regression net. Test coverage is thorough (50+ new tests across 6 test files + 2 integration suites). Documentation updates to integrations README and CHANGELOG are comprehensive.

Should Fix

1. `noopLogWriter` swallows per-failure WARN details in the runtime path

In readWorkItem.ts:212-216, downloadAndPrepareImages is called with a noop log writer. This discards the per-URL failure WARN logs that the boot path (contextSteps.ts) preserves via params.logWriter. The diagnostic summary line only has counts — an operator debugging why a specific URL failed at runtime has no per-URL detail in the run log (they'd need to read the agent's text output). Consider forwarding to logger.warn instead of swallowing.

2. Write failures silently disappear from agent-visible text

When writeRuntimeImages fails to write an image to disk (e.g. disk full), the failure is counted in urlsFailed in the diagnostic log (line 239), but NOT surfaced in the agent's text output — writeFailures is not passed to formatRuntimeImagePaths (line 248 only passes downloadFailures). The image disappears from both the "Local Image Files" and "Failed Image Downloads" sections, violating the spec's principle that failures should never be silently dropped from the agent's perspective.

3. `urlsDetected` field semantics mismatch with documentation

The README documents urlsDetected as "pre-filter count" and urlsAfterFilter as "post-filterImageMedia count." In practice, both code paths receive media that is already post-filter (returned by readWorkItemWithMedia which calls filterImageMedia internally). In the runtime path, both fields are always identical (media.length). In the boot path, they differ only by the MAX cap. An operator triaging a filter-drop scenario would see urlsDetected: 0 and wrongly conclude no images existed in the markdown, when actually images were present but MIME-filtered. Consider renaming the fields to match their actual semantics, or threading the true pre-filter count through from the extraction step.

🕵️ claude-code · claude-opus-4-6 · run details

nhopeatall · 2026-04-26T19:53:33Z

+			// downloadAndPrepareImages takes a LogWriter; runtime gadget
+			// already emits its own diagnostic line below, so per-failure
+			// detail is captured there.
+		};


[SHOULD_FIX] The noopLogWriter discards per-failure WARN logs that the boot path preserves (each failed download URL + error reason). The runtime diagnostic summary line (below) only has counts. Consider forwarding to logger.warn so operators can trace per-URL failures in the runtime path too, consistent with the boot path.

nhopeatall · 2026-04-26T19:53:33Z

+			url: f.url,
+			reason: f.reason,
+		}));
+		const augmented = text + formatRuntimeImagePaths(writePaths, downloadFailures);


[SHOULD_FIX] writeFailures (from writeRuntimeImages) are counted in the diagnostic log urlsFailed (line 239) but not surfaced in the agent text. If a disk-write fails, the image vanishes from both "Local Image Files" and "Failed Image Downloads" — the agent silently loses it. Consider surfacing write failures in the text output as well, or at minimum logging them individually.

nhopeatall · 2026-04-26T19:53:33Z

+			provider: provider?.type ?? 'unknown',
+			workItemId,
+			urlsDetected: media.length,
+			urlsAfterFilter: media.length, // already filtered by readWorkItemWithMedia


[SHOULD_FIX] urlsDetected and urlsAfterFilter are always identical here (both media.length). The README documents these as "pre-filter count" and "post-filterImageMedia count" respectively, but media is already filtered by readWorkItemWithMedia. In the boot path (contextSteps.ts), they differ only by the MAX cap, not by MIME filtering. The field names are misleading — an operator reading them would wrongly infer that MIME filtering happened between these two values.

nhopeatall · 2026-04-26T19:53:34Z

+		// Repo-relative path is what we return to the caller for inclusion in
+		// the agent's text response — the agent's Read tool consumes paths
+		// relative to its workspace root.
+		const relativePath = repoDir


[NITPICK] Dead ternary — both branches produce the identical string ${DEFAULT_CONTEXT_IMAGES_RELATIVE}/${filename}. The repoDir check has no effect on relativePath.

nhopeatall · 2026-04-26T19:53:34Z

+		// `downloadAndPrepareImages` helper so the runtime gadget (spec 016/2)
+		// uses the same code path.
+		const { downloadAndPrepareImages } = await import('../../pm/download-and-prepare.js');
+		const limited = media.slice(0, MAX_IMAGES_PER_WORK_ITEM);


[NITPICK] media is sliced to MAX_IMAGES_PER_WORK_ITEM here, then downloadAndPrepareImages slices again internally (line 46 of download-and-prepare.ts). The double-slice is harmless but redundant — consider removing one.

zbigniewsobiecki · 2026-04-26T22:03:33Z

@aaight address code review concerns

The diagnostic-line assertion expected urlsDetected on the log payload, but the mocked readWorkItemWithMedia return values omitted it, so the field arrived as undefined and the toHaveBeenCalledWith match failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(triggers): audit & fix PM feedback inconsistencies across respond-to-* agents (#1201) * fix(triggers): audit & fix PM feedback inconsistencies across respond-to-* agents * fix(triggers): use case-insensitive JIRA status comparison in isInPlanningStatus Match the established pattern from status-changed.ts and label-added.ts which both use .toLowerCase() for JIRA status comparisons, since status names are user-configurable and the API does not guarantee consistent casing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(linear): populate inlineMedia from descriptions/comments and add downloadAttachment (#1202) Co-authored-by: Cascade Bot <bot@cascade.dev> * fix(claude-code): pin pathToClaudeCodeExecutable so SDK skips broken native-binary probe (#1206) The agent-harness SDK bump in #1197 (claude-agent-sdk 0.2.91 → 0.2.119) broke every review run on cascade-prod with: ReferenceError: Claude Code native binary not found at /app/node_modules/@anthropic-ai/claude-agent-sdk-linux-x64-musl/claude The new SDK probes its own platform-specific optional-dependency subpackages for a bundled `claude` binary. Two failure modes hit at once: 1. Cascade installs `@anthropic-ai/claude-code@2.1.119` globally at /usr/local/bin/claude — the SDK never looks there. 2. The SDK probes the `-musl` variant first regardless of host libc and errors on ENOENT instead of falling through to the glibc variant. Pass an explicit `pathToClaudeCodeExecutable` to short-circuit the probe. The resolver checks (in order): - $CLAUDE_CODE_EXECUTABLE_PATH env override (local-dev escape hatch) - `which claude` in $PATH - /usr/local/bin/claude (Docker default from Dockerfile.worker) Two TDD tests pin the option onto query() and prove the env override wins. No Dockerfile change needed; the existing global install at /usr/local/bin/claude becomes the resolver's runtime target. Confirmed broken on ucho PR #72 (cascade-prod review agent crash). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * spec 015: router job dispatch failure recovery (#1203) * docs(spec/plans): add spec 015 + plans for router dispatch failure recovery Spec captures the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350): a transient capacity miss or Docker error during worker spawn turns a webhook-driven job into a permanently failed BullMQ entry while stranding the work-item / agent-type locks for up to 30 minutes, silently rejecting subsequent webhooks for the same work item. Decomposed into two plans with safety-net-first sequencing: plan 1 hooks the BullMQ failed event to release locks on every dispatch failure path; plan 2 replaces the throw-on-capacity with a wait-for-slot semaphore, adds bounded retry with exponential backoff, and a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 015/1 failed-event-lock-compensation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(router): plan 015/1 done — release locks on dispatch failure Closes the stranded-lock half of spec 015's bug class verified live in prod on 2026-04-26 (ucho/MNG-350). When a webhook-driven job's dispatch fails — capacity throw, Docker spawn error, or any future throw site — the work-item lock, agent-type concurrency counter, and recently-dispatched dedup mark established by the webhook → enqueue path are now released by a compensator hooked to BullMQ's `worker.on('failed')` event. What landed: - `src/router/dispatch-compensator.ts` (new) — `releaseLocksForFailedJob` wraps `extractProjectIdFromJob` / `extractWorkItemId` / `extractAgentType` and calls into `clearWorkItemEnqueued` / `clearAgentTypeEnqueued` / `clearRecentlyDispatched`. Never propagates errors; captures to Sentry with `tags: { source: 'dispatch_compensator' }`. - `src/router/agent-type-lock.ts` — exports new `clearRecentlyDispatched` for the compensator. The existing `markRecentlyDispatched` semantics are unchanged (60s TTL, NOT cleared on completion); this helper exists solely so a permanently-failed dispatch doesn't keep deduping a fresh webhook for ~60s while the user retries. - `src/router/bullmq-workers.ts` — extends the existing `worker.on('failed')` handler to invoke `releaseLocksForFailedJob` alongside the existing logger + Sentry calls. Wraps the call in a defensive `.catch` so a future regression in the compensator can't poison the worker. - `src/router/lock-state-classifier.ts` (new) — `classifyLockState` returns `'awaiting-slot'` when an active worker or queued/waiting job matches the trio, `'wedged'` when neither correlation matches. Defaults to `'awaiting-slot'` on classifier error so a Redis blip doesn't mis-emit the wedged canary. - `src/router/active-workers.ts` — `getActiveWorkers()` now exposes `(projectId, workItemId, agentType)` so the classifier can correlate. Backwards-compatible (existing callers work unchanged; new fields are additive optional). - `src/router/webhook-processor.ts` — Step 8 (work-item lock check) now splits the decision-reason vocabulary into three states: * `Job queued: ...` (success path) * `Awaiting worker slot: ...` (lock held + dispatch in flight; healthy) * `Work item locked (no active dispatch): ...` (wedged-lock canary) The wedged branch additionally fires `captureException` with `tags: { source: 'wedged_lock_canary' }` so any regression in compensation is loud in production. What this does NOT change (intentional, all in plan 015/2): - `guardedSpawn` still throws on capacity (BullMQ marks the job failed, the compensator now releases the locks, but the job itself is still lost). Plan 2 replaces the throw with a wait-for-slot semaphore. - Both queues still default to `attempts: 1`. Plan 2 raises this with exponential backoff and adds a transient/terminal error classifier. - CLAUDE.md is intentionally not updated by this plan — the unified passage describing both halves of the new contract lands in plan 015/2. Tests: - 5 new unit tests in `dispatch-compensator.test.ts` - 3 new unit tests in `agent-type-lock.test.ts` for `clearRecentlyDispatched` - 4 new unit tests in `bullmq-workers.test.ts` for the failed-event seam - 5 new unit tests in `lock-state-classifier.test.ts` - 2 new unit tests in `active-workers.test.ts` for the extended shape - 4 new unit tests in `webhook-processor.test.ts` for the three-way taxonomy - 3 new module-integration tests in `tests/integration/router/dispatch-failure-compensation.test.ts` exercise the real lock modules + real bullmq-workers.ts failed-event handler + real compensator end-to-end (only BullMQ's Worker constructor + the worker-env extractors are mocked). Full suite: 8515 passed / 23 skipped / 0 failed. Lint + typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 015/2 wait-for-slot-and-retry-classifier Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): mark 015/2 status: wip * feat(router): plan 015/2 done — wait-for-slot + retry budget + classifier Closes the lost-job half of spec 015's bug class. Combined with plan 015/1, the silent black-hole failure mode verified live in prod on 2026-04-26 (ucho/MNG-350) is now fully closed. What landed: - `src/router/slot-waiter.ts` (new) — semaphore-style primitive: `acquireSlot({ timeoutMs })` resolves immediately when capacity is below `routerConfig.maxWorkers`, otherwise queues a FIFO waiter with a bounded timeout that rejects with `code: 'SLOT_WAIT_TIMEOUT'`. `slotReleased()` pops the head waiter; `clearAllWaiters()` rejects every pending waiter with `code: 'SHUTDOWN'` on router stop. - `src/router/dispatch-error-classifier.ts` (new) — classifies thrown errors into `'transient'` (Docker socket Node codes, HTTP 429/409, SLOT_WAIT_TIMEOUT, anything unknown — default-to-retry) vs `'terminal'` (TypeError, ZodError, image-not-found-after-fallback). - `src/router/worker-manager.ts` — `guardedSpawn` rewritten: `await acquireSlot(...)` replaces the synchronous capacity throw; on spawn error, terminal errors are wrapped in BullMQ's `UnrecoverableError` so retries skip; transient errors propagate unchanged so BullMQ retries via attempts/backoff. - `src/router/active-workers.ts` — `cleanupWorker` now calls `slotReleased()` exactly once per cleanup, including on the crash path. The existing `if (worker)` guard ensures idempotence. - `src/router/config.ts` — new `slotWaitTimeoutMs` field (default 5min, configurable via `SLOT_WAIT_TIMEOUT_MS`). - `src/router/queue.ts` and `src/queue/client.ts` — both queues now default to `attempts: 4` with `backoff: { type: 'exponential', delay: 5000 }` (~75s total before exhaustion). Terminal errors bypass via `UnrecoverableError`. - `src/router/container-manager.ts` — exports the existing `isImageNotFoundError` predicate so the classifier can reuse it. Test contract change (spec AC #9): The previous `tests/unit/router/worker-manager.test.ts:179` assertion `'processFn throws when at capacity'` is REPLACED (not deleted) with `'processFn awaits a slot when at capacity, then dispatches when one frees'`. The throw-on-capacity contract is gone forever. Tests: - 7 new unit tests in `slot-waiter.test.ts` (FIFO, timeout, no-op, shutdown rejection) - 11 new unit tests in `dispatch-error-classifier.test.ts` covering every transient/terminal class - 4 new unit tests in `worker-manager.test.ts` (replaced original capacity-throw test + 3 for retry classification) - 3 new unit tests in `active-workers.test.ts` for slotReleased integration - 5 new module-integration tests in `dispatch-retry.test.ts` exercise REAL guardedSpawn + REAL slot-waiter + REAL dispatch-error-classifier against both queues, mocking only spawnWorker + BullMQ Worker constructor. Plan 1's 3 module-integration tests continue to pass alongside plan 2's 5. Full unit suite: 8539 passed / 23 skipped / 0 failed. CLAUDE.md updated with a new "Dispatch failure semantics" section documenting the unified contract (capacity wait, retry budget, classifier, three-way decision-reason taxonomy from plan 1, wedged-lock canary). File now 182 lines, under the 200-line cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(spec): 015 done — router job dispatch failure recovery, all plans complete Closes the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350). Plan 1 added failed-event lock compensation + three-way decision-reason taxonomy; plan 2 replaced the throw-on-capacity with wait-for-slot, added bounded retry with exponential backoff, and introduced a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(deps): bump postcss from 8.5.8 to 8.5.12 (#1204) Bumps [postcss](https://github.com/postcss/postcss) from 8.5.8 to 8.5.12. - [Release notes](https://github.com/postcss/postcss/releases) - [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md) - [Commits](postcss/postcss@8.5.8...8.5.12) --- updated-dependencies: - dependency-name: postcss dependency-version: 8.5.12 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump uuid, bullmq and dockerode (#1192) Removes [uuid](https://github.com/uuidjs/uuid). It's no longer used after updating ancestor dependencies [uuid](https://github.com/uuidjs/uuid), [bullmq](https://github.com/taskforcesh/bullmq) and [dockerode](https://github.com/apocas/dockerode). These dependencies need to be updated together. Removes `uuid` Updates `bullmq` from 5.72.0 to 5.76.2 - [Release notes](https://github.com/taskforcesh/bullmq/releases) - [Commits](taskforcesh/bullmq@v5.72.0...v5.76.2) Updates `dockerode` from 4.0.10 to 5.0.0 - [Release notes](https://github.com/apocas/dockerode/releases) - [Commits](apocas/dockerode@v4.0.10...v5.0.0) --- updated-dependencies: - dependency-name: uuid dependency-version: dependency-type: indirect - dependency-name: bullmq dependency-version: 5.76.2 dependency-type: direct:production - dependency-name: dockerode dependency-version: 5.0.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * spec 016: PM image delivery reliability (#1209) * docs(spec/plans): add spec 016 + plans for PM image delivery reliability Closes the silent screenshot-drop bug class verified live on 2026-04-26 (ucho/MNG-357): Linear's user-pasted-image URLs (uploads.linear.app/<uuid> with no file extension) were dropped at the pre-download MIME filter because mimeTypeFromUrl returned 'application/octet-stream' and filterImageMedia excluded them. This affected all engines on the disk-write path, regardless of PR #948's Claude-Code SDK delivery fix. Three plans, safety-net-first sequencing matching spec 015: - Plan 1 (boot-path-mime-fix-and-diagnostic-log): defers MIME authority to download response Content-Type via image/* wildcard sentinel; adds the grep-stable diagnostic log line at extract time. Independently fixes MNG-357. - Plan 2 (runtime-gadget-image-delivery): makes the runtime cascade-tools pm read-work-item gadget actually download + write images to disk with file paths returned in text. Closes the mid-run pickup gap. Depends on Plan 1's shared download-and-prepare helper. - Plan 3 (linear-fixture-and-extraction-coverage): captures a Linear GraphQL Issue payload fixture for an issue with a pasted screenshot; pins extraction with a regression test that fails loudly if Linear ever changes the payload shape. Mostly tests + docs. 9 ACs, 0 manual-only. CLAUDE.md not updated (already covered by spec 015's silent-failure → diagnostic-line pattern). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 016/1 boot-path-mime-fix-and-diagnostic-log * feat(pm): plan 016/1 done — boot-path image MIME fix + diagnostic log Closes the silent screenshot-drop bug class verified live on 2026-04-26 (ucho/MNG-357). Linear's extension-less pasted-image URLs (uploads.linear.app/<uuid>) now survive the pre-download MIME filter via an image/* wildcard sentinel. The download response's Content-Type header is the authoritative MIME — wildcard is resolved before bytes are written. What landed: - src/pm/media.ts — new IMAGE_HOST_ALLOWLIST (currently 'uploads.linear.app'); mimeTypeFromUrl returns 'image/*' for extension-less URLs from allowlisted hosts; isImageMimeType accepts the wildcard. - src/pm/download-and-prepare.ts (new) — shared helper for the per-provider download dispatch loop (jira/linear/trello). Returns { images, failures }. Spec 016/2's runtime gadget will import this. - src/agents/definitions/contextSteps.ts — fetchWorkItemStep refactored to use the shared helper; emits the new grep-stable diagnostic line '[image-pipeline] work-item-fetch summary' with stable fields: { provider, workItemId, urlsDetected, urlsAfterFilter, urlsDownloaded, urlsFailed, urlsByMimeType }. Tests: - 6 new unit tests in tests/unit/pm/media.test.ts (wildcard sentinel, Linear extension-less, regression for extensioned + non-PM URLs) - 7 new unit tests in tests/unit/pm/download-and-prepare.test.ts - 3 new diagnostic-log tests in contextSteps.test.ts; existing log message expectations updated to the new helper-prefix - 3 module-integration tests in tests/integration/pm/image-pipeline.test.ts pinning the MNG-357 reproduction end-to-end with real mimeTypeFromUrl + filterImageMedia + extractMarkdownImages PR #948's Claude-Code initial-input ImageBlockParam path is unchanged; existing regression test (claude-code.test.ts:939 'logs image injection and strips images before buildTaskPrompt') confirms. Docs: - CHANGELOG.md entry under Unreleased. - src/integrations/README.md gains a new 'Image delivery contract' section documenting the shared resolution path, allowlist semantics, diagnostic log line schema, and the rule that providers shouldn't write their own MIME-detection. Full unit suite: 8521 passed / 23 skipped / 0 failed. Lint + typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 016/2 runtime-gadget-image-delivery * feat(pm): plan 016/2 done — runtime gadget delivers images on disk Closes the mid-run image pickup gap from spec 016. The runtime gadget `cascade-tools pm read-work-item` now downloads any image media and writes it to .cascade/context/images/work-item-<id>-img-<index>.<ext>, returning text whose new "Local Image Files" section lists actual file paths the agent's file-read tool can consume. What landed: - src/gadgets/pm/core/writeRuntimeImages.ts (new) — writes ContextImage arrays to .cascade/context/images/ with stable naming convention (work-item-<id>-img-<i>.<ext>); extension derived from resolved MIME; falls back to .bin + warn log for unresolved image/* sentinel. - src/gadgets/pm/core/readWorkItem.ts — readWorkItem now calls downloadAndPrepareImages (Plan 1's helper) + writeRuntimeImages (this plan), then mutates the returned text to include the local file paths via formatRuntimeImagePaths. Same diagnostic log line '[image-pipeline] work-item-fetch summary' as the boot path. Failed downloads surface in a "Failed Image Downloads" subsection. Tests: - 8 new unit tests in tests/unit/gadgets/pm/core/writeRuntimeImages.test.ts - 5 new unit tests in tests/unit/gadgets/pm/core/readWorkItem.test.ts (spec 016/2 sub-describe) - 4 new module-integration tests in tests/integration/gadgets/runtime-image-delivery.test.ts pinning the mid-run pickup contract end-to-end. CHANGELOG.md entry added. Full unit suite (single-fork): 8534 passed / 23 skipped / 0 failed. Lint + typecheck clean. Three PM manifest test suites occasionally time out under parallel load on this machine — verified to pass in isolation; not a code regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 016/3 linear-fixture-and-extraction-coverage * test(pm): plan 016/3 done — Linear fixture + extraction-coverage regression Closes spec 016 with the regression net for the contract Plans 1+2 established. If Linear ever changes its Issue payload shape in a way that loses inline images, the extraction-coverage test fails loudly with a specific URL-missing message. What landed: - tests/fixtures/linear-issue-with-screenshot.json (new) — reconstructed Linear GraphQL Issue payload covering: extension-less uploads.linear.app URL in description, extensioned Linear URL with alt text, external URL with image/svg+xml MIME, non-image markdown link (must NOT be picked up), one comment with a pasted screenshot, one comment without, and three formal Attachment records (Slack/GitHub/Sentry link previews). - tests/unit/pm/linear/extraction-coverage.test.ts (new) — 9 tests: description coverage with explicit expected-URL list, image/* sentinel for extension-less, concrete MIME for extensioned, image/svg+xml for external SVG, non-image link exclusion, comment coverage, comment source field, attachment-NOT-leaked rule, meta-test of regression net. - src/integrations/README.md — new "Linear: GraphQL surface for inline images" subsection documenting the conclusion: Issue.description markdown is canonical for inline-pasted screenshots; Issue.attachments is for formal Attachment records (link previews) and is the wrong surface for inline images. Links to the fixture and the test. No production code change — Plan 1's mimeTypeFromUrl + extractMarkdownImages already cover the cases. This plan ships the regression armor. CHANGELOG.md entry added. Lint + typecheck clean. 9/9 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(spec): 016 done — pm-image-delivery-reliability, all plans complete Closes the silent screenshot-drop bug class verified live on 2026-04-26 (ucho/MNG-357). Plan 1 added the Linear-extension-less MIME wildcard sentinel + diagnostic log line; plan 2 made the runtime cascade-tools pm read-work-item gadget actually deliver images on disk; plan 3 captured a Linear GraphQL fixture and pinned extraction coverage with a regression test. CLAUDE.md untouched by this spec — already covered by spec 015's broader silent-failure → diagnostic-line pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address code review concerns * test(image-pipeline): supply urlsDetected in readWorkItemWithMedia mocks The diagnostic-line assertion expected urlsDetected on the log payload, but the mocked readWorkItemWithMedia return values omitted it, so the field arrived as undefined and the toHaveBeenCalledWith match failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cascade Bot <bot@cascade.dev> * fix(linear): drop comment-mention planning-state gate that prod payload never satisfies (#1210) PR #1201 added a `currentStateId !== planningStateId` gate to the Linear comment @mention trigger that read `data.issue.stateId` from the webhook payload. Linear's Comment webhook does not ship `stateId` on the nested issue (verified across four prod payloads on 2026-04-26 — 8cd0108a, b93e4925, 6548cd14, 3d95b210). The gate therefore always evaluated to true and silently dropped every legitimate bot @mention, including the one on MNG-346 that motivated this fix. The agent (respond-to-planning-comment) is now responsible for any planning-only behavior; the trigger no longer gates on state and avoids an extra Linear GraphQL round-trip per comment. Also corrects `LinearWebhookCommentTriggerData.issue` to match what Linear actually ships (six keys, no `stateId`, optional `team`) — the old type lied and PR #1201 trusted it. Tests pin a real prod-shape Comment payload as a regression. JIRA's equivalent gate is unaffected (its `comment_created` payload does ship `issue.fields.status.name`). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: aaight <aaight42@gmail.com> Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

zbigniewsobiecki and others added 8 commits April 26, 2026 19:45

chore(plan): lock 016/1 boot-path-mime-fix-and-diagnostic-log

1a97d88

chore(plan): lock 016/2 runtime-gadget-image-delivery

cee6e7a

chore(plan): lock 016/3 linear-fixture-and-extraction-coverage

9ba8b4c

zbigniewsobiecki requested a review from nhopeatall April 26, 2026 19:47

zbigniewsobiecki force-pushed the feat/spec-016-pm-image-delivery branch from f66d17c to 03d4901 Compare April 26, 2026 19:47

nhopeatall approved these changes Apr 26, 2026

View reviewed changes

Cascade Bot and others added 2 commits April 26, 2026 22:13

fix: address code review concerns

7a913b2

zbigniewsobiecki merged commit 1ced61b into dev Apr 26, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec 016: PM image delivery reliability#1209

spec 016: PM image delivery reliability#1209
zbigniewsobiecki merged 10 commits intodevfrom
feat/spec-016-pm-image-delivery

zbigniewsobiecki commented Apr 26, 2026

Uh oh!

codecov Bot commented Apr 26, 2026 •

edited

Loading

Uh oh!

nhopeatall left a comment

Uh oh!

nhopeatall Apr 26, 2026

Uh oh!

nhopeatall Apr 26, 2026

Uh oh!

nhopeatall Apr 26, 2026

Uh oh!

nhopeatall Apr 26, 2026

Uh oh!

nhopeatall Apr 26, 2026

Uh oh!

zbigniewsobiecki commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zbigniewsobiecki commented Apr 26, 2026

Summary

Test plan

Files changed

Uh oh!

codecov Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nhopeatall left a comment

Choose a reason for hiding this comment

Summary

Should Fix

1. noopLogWriter swallows per-failure WARN details in the runtime path

2. Write failures silently disappear from agent-visible text

3. urlsDetected field semantics mismatch with documentation

Uh oh!

nhopeatall Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

nhopeatall Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

nhopeatall Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

nhopeatall Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

nhopeatall Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

zbigniewsobiecki commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 26, 2026 •

edited

Loading

1. `noopLogWriter` swallows per-failure WARN details in the runtime path

3. `urlsDetected` field semantics mismatch with documentation