fix: improve PostgreSQL startup reliability and agent documentation by zbigniewsobiecki · Pull Request #9 · mongrel-intelligence/cascade

zbigniewsobiecki · 2026-01-01T20:19:09Z

Summary

This PR improves PostgreSQL handling in the agent execution environment:

Fail-fast startup: Agent sessions will not start if PostgreSQL fails to start, preventing wasted iterations troubleshooting database connectivity
Proper authentication: PostgreSQL configured with password authentication (postgres:postgres)
Agent documentation: Environment prompt now includes complete PostgreSQL management commands

Changes

`src/agents/utils/setup.ts`

Added robust error handling to startPostgres():
- Checks if PostgreSQL is already running before starting
- Uses -w flag for synchronous startup (waits until ready)
- Throws descriptive error if startup fails
- Verifies PostgreSQL is actually running after start

`Dockerfile`

Configured pg_hba.conf for proper authentication:
- Local socket connections use trust (for admin tasks)
- TCP connections require md5 password authentication
- Set password postgres for user postgres
Connection string: postgresql://postgres:postgres@localhost:5432/postgres

`src/agents/prompts/templates/partials/environment.eta`

Added complete PostgreSQL documentation for agents:
- Connection string
- CLI access command
- Start/stop/status commands using pg_ctl
- Database creation example

Context

Analysis of a failed agent session revealed the agent wasted 11+ iterations trying to start PostgreSQL using incorrect commands (systemctl, brew services) that don't work in the container environment. This PR ensures:

Agents know the correct commands to manage PostgreSQL
Sessions fail immediately if PostgreSQL can't start

Test Plan

Local tests pass
CI tests pass
Docker build succeeds with new PostgreSQL configuration

🤖 Generated with Claude Code

- Add robust error handling to startPostgres() that fails fast if PostgreSQL cannot start, preventing agent sessions from proceeding with a broken database - Configure PostgreSQL in Dockerfile with password authentication: - User: postgres, Password: postgres - Connection string: postgresql://postgres:postgres@localhost:5432/postgres - Local socket uses trust, TCP connections require md5 password - Update agent environment prompt with complete PostgreSQL documentation: - Connection string and CLI access - Start/stop/status commands using pg_ctl - Database creation example This addresses issues where agents would waste iterations trying to troubleshoot PostgreSQL connectivity using incorrect commands (systemctl, brew services) that don't work in the container environment. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Plan 003/1 (status-parity). Linear PM wizard Field Mapping step now exposes all 8 CASCADE stages (backlog, splitting, planning, todo, inProgress, inReview, done, merged) in lifecycle order instead of only 4. Linear operators can now map workflow states to splitting/planning/todo/merged and have the corresponding agents dispatch on issue transitions. JIRA's silent-drop bug in resolveLifecycleConfig fixed: splitting/ planning/todo mappings the JIRA wizard already accepted now surface through to PMLifecycleManager and GitHub PR triggers. No operator action required. Canonical ProjectPMConfig.statuses widens to declare the full 9-stage vocabulary (including debug, reserved for future trigger), so providers can no longer silently drift from the trigger layer. Existing Linear integrations upgrade in place: new slots render as 'not set' on next wizard visit. No migration. Tests: 9 new unit tests (type shape + Linear + JIRA integration + SSR wizard). Integration coverage for spec ACs #8/#9 provided by existing linear-status-changed and jira-status-changed trigger handler tests — discovered during Phase 4 that handlers read provider-specific config directly (not resolveLifecycleConfig), so dispatch was never blocked for handlers; the drop was in downstream PMLifecycleManager callers. Totals: 7597 unit + 522 integration all green. Lint + typecheck clean. Closes spec 003. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(plans): add spec 003 + plan; lock plan 003/1 Spec 003 introduces PM status mapping parity across Linear and JIRA. Plan 1 (status-parity) locked for execution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(plans): plan 003/1 frontmatter status -> wip Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(pm): status mapping parity across Linear and JIRA Plan 003/1 (status-parity). Linear PM wizard Field Mapping step now exposes all 8 CASCADE stages (backlog, splitting, planning, todo, inProgress, inReview, done, merged) in lifecycle order instead of only 4. Linear operators can now map workflow states to splitting/planning/todo/merged and have the corresponding agents dispatch on issue transitions. JIRA's silent-drop bug in resolveLifecycleConfig fixed: splitting/ planning/todo mappings the JIRA wizard already accepted now surface through to PMLifecycleManager and GitHub PR triggers. No operator action required. Canonical ProjectPMConfig.statuses widens to declare the full 9-stage vocabulary (including debug, reserved for future trigger), so providers can no longer silently drift from the trigger layer. Existing Linear integrations upgrade in place: new slots render as 'not set' on next wizard visit. No migration. Tests: 9 new unit tests (type shape + Linear + JIRA integration + SSR wizard). Integration coverage for spec ACs #8/#9 provided by existing linear-status-changed and jira-status-changed trigger handler tests — discovered during Phase 4 that handlers read provider-specific config directly (not resolveLifecycleConfig), so dispatch was never blocked for handlers; the drop was in downstream PMLifecycleManager callers. Totals: 7597 unit + 522 integration all green. Lint + typecheck clean. Closes spec 003. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(specs): spec 003 (linear-status-mapping-parity) done — all plans complete Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

First real consumer of the shared wizard components. Trello's legacy per-provider step file (pm-wizard-trello-steps.tsx) has no live importers outside itself; deletion deferred to plan 011/5. - TrelloOAuthStep: new custom step at pm-providers/trello/oauth-step.tsx. Lifts the window.open popup + manual-token fallback verbatim from the legacy TrelloCredentialsStep. Registered as kind:'custom' with component:'TrelloOAuthStep' in trelloManifest.wizardSpec. - trelloManifest.wizardSpec.steps: now [custom(TrelloOAuthStep), container-pick, status-mapping, label-mapping, custom-field-mapping, webhook-url-display] — 6 steps, one of them custom. - trelloProviderWizard.steps: rewritten to consume shared components via thin per-step adapters. useProviderHooks returns the flat shape each adapter slices (boardOptions, providerStates, providerLabels, providerCustomFields, onCreate* callbacks). Adapters call shared components directly with Trello-specific props. - Adapters file (trello/adapters.tsx): deleted — orphaned after the wizard rewrite. - useTrelloCustomFieldCreation: now accepts { name: string } argument (was hard-coded "Cost"). Enables the shared Create-form UX. Forward-edit to plan 011/1 (additive, existing tests unchanged): - label-mapping widened with optional labelDefaults?: Record<slot, {name, color?}> — pre-populates Create input, threads color to onCreateLabel. Trello uses it for cascade-ready/processing/etc. - custom-field-mapping widened with optional fieldDefaults?: Record<slot, {name}> — pre-populates Create input. Trello uses it for cost field. Normalize-upward UX changes (user-approved in behavior inventory): - Dropped retry button on board-picker error (shared component shows error text; operator refreshes page). - Dropped "Create All Missing Labels" batch button (per-slot Create covers the same ground, one click at a time). AC #9 (no operator regression) marked **deferred** — browser smoke test pending reviewer verification. Unit tests + conformance harness cover wire-level invariants; no runtime behavior change in adapters (discovery / label-creation / custom-field-creation hooks reused unchanged). 19 Trello tests: 5 wizardSpec + 7 oauth-step + 7 wizard-generator. Plus 2 forward-edit tests on the widened shared steps. Full suite 8169/8169, lint + typecheck + build all green. Closes plan 011/2 of spec 011. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

…migration (specs 010 + 011/1-2) (#1148) * docs(010): add spec + plans for PM integration hardening followups * chore(010/1): lock plan 1 as .wip * feat(010/1): manifest createCustomField hook + pm.discovery mutations * chore(010/1): mutations complete, plan done * chore(010/2): lock plan 2 as .wip * feat(010/2): currentUser discovery capability + provider implementations * chore(010/2): read cleanup done, currentUser UX restored * chore(010/3): lock plan 3 with narrowed scope (option B) * feat(010/3): wizard-components done — shared step components + generator dispatch Upgrades the wizard generator from spec-010/1 placeholders to real shared React components for every StandardStepKind. Six new components at web/src/components/projects/pm-providers/steps/*.tsx: credentials, container-pick, status-mapping, label-mapping, webhook-url-display, project-scope. Generator exports STANDARD_STEP_COMPONENTS registry and dispatches through it; unknown kinds still warn-once and render a placeholder. Trello/JIRA/Linear wizards keep their per-provider step adapters from the spec-006 era — a future plan migrates them. The shared path is live for new providers today. new-provider-surface snapshot is tightened to pin the six new files; wizard-generator + per-provider manifest-wizardSpec tests now assert element.type identity against the registry instead of placeholder DOM shapes. 55 new/updated tests, all green. Docs updated: src/integrations/README.md (post-spec-010 additions), root CLAUDE.md (PM-integration summary), spec 009 forward-references spec 010, CHANGELOG entries for specs 009 + 010. Closes plan 010/3 of spec 010. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * chore(010): spec done — all plans complete All three plans of spec 010 (PM integration hardening follow-ups) shipped. Mutations (010/1) added generic pm.discovery.createLabel / createCustomField endpoints. Read cleanup (010/2) added currentUser discovery capability. Wizard components (010/3) landed real shared React components for every StandardStepKind. Spec marked .done. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * docs(011): spec + plan decomposition for wizard shared migration Spec 011 (PM Wizard Shared Migration) and its 5-plan decomposition. Migrates Trello/JIRA/Linear wizards onto the shared StandardStepKind components landed by spec 010 — closes the "zero per-provider step code" promise across all three production providers, not just new providers. Plans: 1-shared-components (widen container-pick/project-scope with searchable mode + widen webhook-url-display with optional signing- secret + add 7th StandardStepKind: custom-field-mapping), 2-trello (first consumer; OAuth stays kind:'custom'), 3-jira (issue-type stays kind:'custom'; free-text label mode), 4-linear (retire LinearWebhook- InfoPanel in favor of widened shared component), 5-cleanup (delete pm-wizard-{trello,jira,linear}-steps.tsx + final docs rewrite). Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * chore(011/1): lock plan 1 (shared-components) Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * feat(011/1): shared-components done — widen 3 steps + add custom-field-mapping kind Foundation plan for the wizard migration. Three additive widenings + one new StandardStepKind. All changes dormant until plan 2 activates them; the 31 spec-010 step tests pass unchanged as the backward-compat proof. - container-pick gains optional searchable?: boolean → dispatches to the existing shared Combobox (cmdk + radix) when true. - project-scope gains the same searchable? prop; empty value still means "no scope" in both render paths. - webhook-url-display gains optional secretFieldRole / secretLabel / secretValue / onSecretChange → renders an inline <input type="password"> below the URL when both role + callback are supplied. Defensive: omits the input if role is set but callback is not (avoids uncontrolled secret inputs silently dropping user input). - 7th StandardStepKind: 'custom-field-mapping'. New shared component at web/src/components/projects/pm-providers/steps/custom-field-mapping.tsx renders one row per CASCADE slot with a dropdown of discovered provider custom fields + optional inline "Create…" affordance wired to manifest.createCustomField (spec 010/1). Visual idiom matches status-mapping. - STANDARD_STEP_COMPONENTS registers the new kind; generator dispatch falls through the existing switch path. - new-provider-surface snapshot pins the 7th file. Tests use element-tree identity checks where SSR would hit the React instance mismatch (radix lives in web/node_modules and pulls its own React). 17 new/updated test assertions across 4 files. Full suite 8153/8153, lint 0/0, typecheck + build green. Closes plan 011/1 of spec 011. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * chore(011/2): lock plan 2 (trello) Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * feat(011/2): trello done — wizard migrated to shared step components First real consumer of the shared wizard components. Trello's legacy per-provider step file (pm-wizard-trello-steps.tsx) has no live importers outside itself; deletion deferred to plan 011/5. - TrelloOAuthStep: new custom step at pm-providers/trello/oauth-step.tsx. Lifts the window.open popup + manual-token fallback verbatim from the legacy TrelloCredentialsStep. Registered as kind:'custom' with component:'TrelloOAuthStep' in trelloManifest.wizardSpec. - trelloManifest.wizardSpec.steps: now [custom(TrelloOAuthStep), container-pick, status-mapping, label-mapping, custom-field-mapping, webhook-url-display] — 6 steps, one of them custom. - trelloProviderWizard.steps: rewritten to consume shared components via thin per-step adapters. useProviderHooks returns the flat shape each adapter slices (boardOptions, providerStates, providerLabels, providerCustomFields, onCreate* callbacks). Adapters call shared components directly with Trello-specific props. - Adapters file (trello/adapters.tsx): deleted — orphaned after the wizard rewrite. - useTrelloCustomFieldCreation: now accepts { name: string } argument (was hard-coded "Cost"). Enables the shared Create-form UX. Forward-edit to plan 011/1 (additive, existing tests unchanged): - label-mapping widened with optional labelDefaults?: Record<slot, {name, color?}> — pre-populates Create input, threads color to onCreateLabel. Trello uses it for cascade-ready/processing/etc. - custom-field-mapping widened with optional fieldDefaults?: Record<slot, {name}> — pre-populates Create input. Trello uses it for cost field. Normalize-upward UX changes (user-approved in behavior inventory): - Dropped retry button on board-picker error (shared component shows error text; operator refreshes page). - Dropped "Create All Missing Labels" batch button (per-slot Create covers the same ground, one click at a time). AC #9 (no operator regression) marked **deferred** — browser smoke test pending reviewer verification. Unit tests + conformance harness cover wire-level invariants; no runtime behavior change in adapters (discovery / label-creation / custom-field-creation hooks reused unchanged). 19 Trello tests: 5 wizardSpec + 7 oauth-step + 7 wizard-generator. Plus 2 forward-edit tests on the widened shared steps. Full suite 8169/8169, lint + typecheck + build all green. Closes plan 011/2 of spec 011. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * chore(011/3): lock plan 3 (jira) Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * feat(011/3): jira done — wizard migrated to shared step components Second real consumer of the shared wizard components. JIRA's legacy per-provider step file (pm-wizard-jira-steps.tsx) has no live importers outside itself; deletion deferred to plan 011/5. - IssueTypeMappingStep: new JIRA-specific custom step at pm-providers/jira/issue-type-step.tsx. Maps CASCADE task/subtask roles to JIRA issue types (filtered by the `subtask` flag). Stays kind:'custom' rather than becoming an 8th StandardStepKind because JIRA is the sole consumer today — speculative abstraction avoided. - jiraManifest.wizardSpec.steps: now [credentials, container-pick, status-mapping, label-mapping, custom-field-mapping, custom(IssueTypeMappingStep), webhook-url-display] — 7 steps, one custom. - jiraProviderWizard.steps: rewritten to consume shared components via thin per-step adapters. Credentials step uses the shared `CredentialsStep` with a synthetic `base_url` role alongside email + api_token — no OAuth popup needed for JIRA (unlike Trello). Label mapping passes providerLabels: [] so the shared step renders in free-text mode (JIRA labels are free-form). - Adapters file (jira/adapters.tsx): deleted — orphaned after rewrite. - useJiraCustomFieldCreation: now accepts { name: string } argument (was hard-coded "Cost") so the shared Create affordance works. Task 1 behavior inventory found the same 4 gap classes Trello surfaced; all four were already closed by plan 011/2's forward-edit to plan 011/1 (labelDefaults + fieldDefaults additive widenings). No additional shared-component changes were required for JIRA. AC #10 (no operator regression) marked **deferred** — browser smoke test pending reviewer verification on the deployed branch. Unit tests + conformance harness cover wire-level invariants; legacy discovery + custom-field hooks reused unchanged (only the name-arg tweak on the custom-field mutation). 17 JIRA tests: 6 manifest + 7 issue-type + 4 wizard-generator. JIRA had zero dedicated wizard-step tests before this plan — this is the first JIRA wizard coverage landing. Full suite 8185/8185, lint + typecheck + build all green. Closes plan 011/3 of spec 011. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * chore(011/4): lock plan 4 (linear) Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * feat(011/4): linear done — + parent-wizard fix for plans 2+3 regression Third real consumer of the shared wizard components. Plan 011/4 also ships a critical fix for a regression plans 011/2 and 011/3 introduced: pm-wizard.tsx hardcoded 3 manifest step slots (stepIndex 0/1/2) from the spec-006 era. Trello/JIRA wizardSpecs grew to 6+ steps; only the first 3 rendered on the deploy — label-mapping, custom-field-mapping, and issue-type-mapping steps were INVISIBLE in production. Fix: pm-wizard.tsx now iterates over `manifestDef.steps`, rendering one WizardStep slot per entry. Webhook steps (id ends with `-webhook`) are filtered out — the legacy WebhookStep still owns programmatic webhook registration (Trello/JIRA API calls) and Linear's signing-secret UX. The shared `webhook-url-display` component (widened in plan 011/1) remains dormant for the three existing providers until a follow-up plan migrates webhook-creation UX into the manifest path. Linear wizard migration: - linearProviderWizard.steps: rewritten to consume shared components via 6 thin per-step adapters. No kind:'custom' steps — Linear has no OAuth popup (like Trello) and no issue-type mapping (like JIRA). - LinearWebhookDisplayAdapter: Fragment composing shared WebhookUrlDisplayStep + ProjectSecretField (LINEAR_WEBHOOK_SECRET). Currently dormant; activates after legacy WebhookStep migration. - project-scope step (spec 005): uses the shared ProjectScopeStep with `searchable: true`. - label-mapping: uses shared component with LINEAR_LABEL_DEFAULTS (plan 011/1 forward-edit) pre-populating the Create input with cascade-ready/processing/etc. and threading hex colors. - Adapters file (linear/adapters.tsx): deleted — orphaned after rewrite. - Legacy step tests deleted: linear-field-mapping-step.test.ts, linear-team-step.test.ts, linear-webhook-info-panel.test.ts (-450 lines). Replaced by 8-test linear-wizard-generator.test.ts covering the wizard wiring + manifest↔definition parity. AC #3 (inline webhook secret), #6 (LinearWebhookInfoPanel retired), and #11 (no operator regression) marked **partial/deferred** — see Progress section for details. All other ACs green. Full suite 8167/8167, lint + typecheck + build all green. Closes plan 011/4 of spec 011. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * chore(011/5): lock plan 5 (cleanup) Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * feat(011/5): cleanup done — deleted 3 legacy step files + docs rewrite Closes spec 011 per user-approved option (a) — tight scope: deletions + docs. Scope-clipped items (full LinearWebhookInfoPanel retirement, Linear inline-secret via shared component) carry over as follow-up work; rationale is captured in the plan's Progress section. Deletions: - web/src/components/projects/pm-wizard-trello-steps.tsx (retired since plan 011/2; no live importers) - web/src/components/projects/pm-wizard-jira-steps.tsx (since 011/3) - web/src/components/projects/pm-wizard-linear-steps.tsx (since 011/4) - pm-wizard.tsx dead comments about transitive imports of the above Audits: - pm-wizard-common-steps.tsx — all three remaining exports (LinearWebhookInfoPanel, WebhookStep, SaveStep) still have live consumers via pm-wizard.tsx. File retained. - Dead-code grep: only doc-comment references to the deleted files remain; no live imports. Docs: - src/integrations/README.md — four-specs preamble (006/009/010/011); "seven kinds" in "Adding a new PM provider" step 3; Post-spec-011 additions table alongside the Post-spec-010 one. - CLAUDE.md (project root) — PM-integration summary references spec 011. - CHANGELOG.md — Internal entry for spec 011 alongside 009/010. - docs/specs/010-pm-integration-hardening-followups.md.done — forward-reference blockquote to spec 011. Verification: - npm test: 8167 passed, 23 skipped - npm run lint: clean - npm run typecheck: green - npm run build: green - Conformance harness: all three providers pass - new-provider-surface guard: 7 step files pinned Closes plan 011/5 of spec 011. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * chore(011): spec done — all plans complete Five plans of spec 011 (PM Wizard Shared Migration) shipped. Shared components widened (plan 1), Trello migrated (plan 2), JIRA migrated (plan 3), Linear migrated + pm-wizard.tsx parent refactor (plan 4), legacy per-provider step files deleted + docs closed (plan 5). Spec marked .done. Deferred to follow-up spec: - Full migration of webhook-creation UX (Trello/JIRA programmatic webhook registration + Linear signing-secret persistence) into the manifest path. Legacy WebhookStep + LinearWebhookInfoPanel still render for this. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4 (1M context) <noreply@anthropic.com>

…tructured envelope, --comment alias) (#1190) * docs(014): spec + plans for cascade-tools agent ergonomics Adds docs/specs/014-cascade-tools-agent-ergonomics.md plus two plans covering shared-infra and create-pr-review adoption. Prompted by prod run 5d993b04-6e05-4ae1-b7de-8c274cf3496b. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan-014): lock plan 1 (shared-infra) * feat(cascade-tools): plan 014/1 shared-infra — truthful prompts + envelope Ships the root-cause fix for prod run 5d993b04-6e05-4ae1-b7de-8c274cf3496b plus the shared infrastructure every future gadget inherits: - System-prompt renderer (src/backends/shared/nativeToolPrompts.ts) stops stripping trailing 's' from array param names and claiming '<string> (repeatable)' for every array. Array-of-object params now render as `--<flag> '<json>'` with aliases appended via `|` and a one-line runnable example from the tool definition. - Factory (src/gadgets/shared/cliCommandFactory.ts) gains oclif flag aliases, JSON parsing for array-of-object flags, file-input JSON parsing, `examples` wired into oclif `--help`, and Levenshtein-based 'did you mean' suggestions for mistyped flags (via fastest-levenshtein). - New shared error envelope (src/gadgets/shared/errorEnvelope.ts) — every CLI failure emits `{"success":false,"error":{type,flag?,message,got?, expected?,hint?,example?}}` on stdout plus a one-line prose summary on stderr. All prior `this.error()` / flat `{success:false,error:"<string>"}` call sites migrated. - Contracts widened: ParameterDefinition gains `cliAliases`, FileInput- Alternative gains `parseAs`, ToolManifest parameters carry `items`, `aliases`, `example`. - Manifest generator threads the new fields through. - bin/cascade-tools.js wraps `run()` to swallow oclif ExitError cleanly so the envelope isn't obscured by Node's default stack dump. Plan-1 ACs #1–#17 all delivered. 8438/8438 unit tests passing. Test surface delta: 57 new unit tests across errorEnvelope.test.ts, shared-nativeToolPrompts.test.ts, and factories.test.ts. Seven legacy assertions encoding the pre-014 error surface updated in cli/cli-command- factory, cli/file-input-flags, cli/scm/create-pr-sidecar, cli/scm/create- pr-review-sidecar, backends/claude-code. Plan 2 adopts the pattern on createPRReviewDef — zero shared-file edits — proving the declarative-metadata invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan-014): lock plan 2 (createprreview-adopt) * feat(cascade-tools): plan 014/2 createprreview-adopt + spec done Applies the spec-014 declarative-metadata pattern to createPRReviewDef: - --comment alias for --comments (the exact muscle-memory mistake from prod run 5d993b04-6e05-4ae1-b7de-8c274cf3496b). - --comments-file <path> (and - for stdin) JSON-parsed escape hatch for long payloads that don't survive shell quoting. - Two declarative fields on createPRReviewDef.parameters.comments.cliAliases + createPRReviewDef.cli.fileInputAlternatives. Zero edits to shared infrastructure (cliCommandFactory, manifestGenerator, nativeToolPrompts, errorEnvelope) — proves spec 014's single-entrypoint invariant. Per-plan ACs #1, #2, #3, #5, #6, #7, #8, #9, #11, #12 auto-verified (unit tests + build + lint + typecheck). AC #4 (binary-level smoke) tagged [manual] because vitest fork-pool workers fail to capture stdout/stderr from spawned binaries that do top-level await import(); the six scenarios were verified manually against the built binary and the trace is recorded in the plan. AC #10 n/a — integration test path abandoned for the same reason. All plans done. Spec 014 marked .done (docs/specs/014-*.md → .done). CHANGELOG Unreleased updated with a per-plan entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(spec/plans): add spec 015 + plans for router dispatch failure recovery Spec captures the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350): a transient capacity miss or Docker error during worker spawn turns a webhook-driven job into a permanently failed BullMQ entry while stranding the work-item / agent-type locks for up to 30 minutes, silently rejecting subsequent webhooks for the same work item. Decomposed into two plans with safety-net-first sequencing: plan 1 hooks the BullMQ failed event to release locks on every dispatch failure path; plan 2 replaces the throw-on-capacity with a wait-for-slot semaphore, adds bounded retry with exponential backoff, and a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 015/1 failed-event-lock-compensation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(router): plan 015/1 done — release locks on dispatch failure Closes the stranded-lock half of spec 015's bug class verified live in prod on 2026-04-26 (ucho/MNG-350). When a webhook-driven job's dispatch fails — capacity throw, Docker spawn error, or any future throw site — the work-item lock, agent-type concurrency counter, and recently-dispatched dedup mark established by the webhook → enqueue path are now released by a compensator hooked to BullMQ's `worker.on('failed')` event. What landed: - `src/router/dispatch-compensator.ts` (new) — `releaseLocksForFailedJob` wraps `extractProjectIdFromJob` / `extractWorkItemId` / `extractAgentType` and calls into `clearWorkItemEnqueued` / `clearAgentTypeEnqueued` / `clearRecentlyDispatched`. Never propagates errors; captures to Sentry with `tags: { source: 'dispatch_compensator' }`. - `src/router/agent-type-lock.ts` — exports new `clearRecentlyDispatched` for the compensator. The existing `markRecentlyDispatched` semantics are unchanged (60s TTL, NOT cleared on completion); this helper exists solely so a permanently-failed dispatch doesn't keep deduping a fresh webhook for ~60s while the user retries. - `src/router/bullmq-workers.ts` — extends the existing `worker.on('failed')` handler to invoke `releaseLocksForFailedJob` alongside the existing logger + Sentry calls. Wraps the call in a defensive `.catch` so a future regression in the compensator can't poison the worker. - `src/router/lock-state-classifier.ts` (new) — `classifyLockState` returns `'awaiting-slot'` when an active worker or queued/waiting job matches the trio, `'wedged'` when neither correlation matches. Defaults to `'awaiting-slot'` on classifier error so a Redis blip doesn't mis-emit the wedged canary. - `src/router/active-workers.ts` — `getActiveWorkers()` now exposes `(projectId, workItemId, agentType)` so the classifier can correlate. Backwards-compatible (existing callers work unchanged; new fields are additive optional). - `src/router/webhook-processor.ts` — Step 8 (work-item lock check) now splits the decision-reason vocabulary into three states: * `Job queued: ...` (success path) * `Awaiting worker slot: ...` (lock held + dispatch in flight; healthy) * `Work item locked (no active dispatch): ...` (wedged-lock canary) The wedged branch additionally fires `captureException` with `tags: { source: 'wedged_lock_canary' }` so any regression in compensation is loud in production. What this does NOT change (intentional, all in plan 015/2): - `guardedSpawn` still throws on capacity (BullMQ marks the job failed, the compensator now releases the locks, but the job itself is still lost). Plan 2 replaces the throw with a wait-for-slot semaphore. - Both queues still default to `attempts: 1`. Plan 2 raises this with exponential backoff and adds a transient/terminal error classifier. - CLAUDE.md is intentionally not updated by this plan — the unified passage describing both halves of the new contract lands in plan 015/2. Tests: - 5 new unit tests in `dispatch-compensator.test.ts` - 3 new unit tests in `agent-type-lock.test.ts` for `clearRecentlyDispatched` - 4 new unit tests in `bullmq-workers.test.ts` for the failed-event seam - 5 new unit tests in `lock-state-classifier.test.ts` - 2 new unit tests in `active-workers.test.ts` for the extended shape - 4 new unit tests in `webhook-processor.test.ts` for the three-way taxonomy - 3 new module-integration tests in `tests/integration/router/dispatch-failure-compensation.test.ts` exercise the real lock modules + real bullmq-workers.ts failed-event handler + real compensator end-to-end (only BullMQ's Worker constructor + the worker-env extractors are mocked). Full suite: 8515 passed / 23 skipped / 0 failed. Lint + typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 015/2 wait-for-slot-and-retry-classifier Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): mark 015/2 status: wip * feat(router): plan 015/2 done — wait-for-slot + retry budget + classifier Closes the lost-job half of spec 015's bug class. Combined with plan 015/1, the silent black-hole failure mode verified live in prod on 2026-04-26 (ucho/MNG-350) is now fully closed. What landed: - `src/router/slot-waiter.ts` (new) — semaphore-style primitive: `acquireSlot({ timeoutMs })` resolves immediately when capacity is below `routerConfig.maxWorkers`, otherwise queues a FIFO waiter with a bounded timeout that rejects with `code: 'SLOT_WAIT_TIMEOUT'`. `slotReleased()` pops the head waiter; `clearAllWaiters()` rejects every pending waiter with `code: 'SHUTDOWN'` on router stop. - `src/router/dispatch-error-classifier.ts` (new) — classifies thrown errors into `'transient'` (Docker socket Node codes, HTTP 429/409, SLOT_WAIT_TIMEOUT, anything unknown — default-to-retry) vs `'terminal'` (TypeError, ZodError, image-not-found-after-fallback). - `src/router/worker-manager.ts` — `guardedSpawn` rewritten: `await acquireSlot(...)` replaces the synchronous capacity throw; on spawn error, terminal errors are wrapped in BullMQ's `UnrecoverableError` so retries skip; transient errors propagate unchanged so BullMQ retries via attempts/backoff. - `src/router/active-workers.ts` — `cleanupWorker` now calls `slotReleased()` exactly once per cleanup, including on the crash path. The existing `if (worker)` guard ensures idempotence. - `src/router/config.ts` — new `slotWaitTimeoutMs` field (default 5min, configurable via `SLOT_WAIT_TIMEOUT_MS`). - `src/router/queue.ts` and `src/queue/client.ts` — both queues now default to `attempts: 4` with `backoff: { type: 'exponential', delay: 5000 }` (~75s total before exhaustion). Terminal errors bypass via `UnrecoverableError`. - `src/router/container-manager.ts` — exports the existing `isImageNotFoundError` predicate so the classifier can reuse it. Test contract change (spec AC #9): The previous `tests/unit/router/worker-manager.test.ts:179` assertion `'processFn throws when at capacity'` is REPLACED (not deleted) with `'processFn awaits a slot when at capacity, then dispatches when one frees'`. The throw-on-capacity contract is gone forever. Tests: - 7 new unit tests in `slot-waiter.test.ts` (FIFO, timeout, no-op, shutdown rejection) - 11 new unit tests in `dispatch-error-classifier.test.ts` covering every transient/terminal class - 4 new unit tests in `worker-manager.test.ts` (replaced original capacity-throw test + 3 for retry classification) - 3 new unit tests in `active-workers.test.ts` for slotReleased integration - 5 new module-integration tests in `dispatch-retry.test.ts` exercise REAL guardedSpawn + REAL slot-waiter + REAL dispatch-error-classifier against both queues, mocking only spawnWorker + BullMQ Worker constructor. Plan 1's 3 module-integration tests continue to pass alongside plan 2's 5. Full unit suite: 8539 passed / 23 skipped / 0 failed. CLAUDE.md updated with a new "Dispatch failure semantics" section documenting the unified contract (capacity wait, retry budget, classifier, three-way decision-reason taxonomy from plan 1, wedged-lock canary). File now 182 lines, under the 200-line cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(spec): 015 done — router job dispatch failure recovery, all plans complete Closes the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350). Plan 1 added failed-event lock compensation + three-way decision-reason taxonomy; plan 2 replaced the throw-on-capacity with wait-for-slot, added bounded retry with exponential backoff, and introduced a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(triggers): audit & fix PM feedback inconsistencies across respond-to-* agents (#1201) * fix(triggers): audit & fix PM feedback inconsistencies across respond-to-* agents * fix(triggers): use case-insensitive JIRA status comparison in isInPlanningStatus Match the established pattern from status-changed.ts and label-added.ts which both use .toLowerCase() for JIRA status comparisons, since status names are user-configurable and the API does not guarantee consistent casing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(linear): populate inlineMedia from descriptions/comments and add downloadAttachment (#1202) Co-authored-by: Cascade Bot <bot@cascade.dev> * fix(claude-code): pin pathToClaudeCodeExecutable so SDK skips broken native-binary probe (#1206) The agent-harness SDK bump in #1197 (claude-agent-sdk 0.2.91 → 0.2.119) broke every review run on cascade-prod with: ReferenceError: Claude Code native binary not found at /app/node_modules/@anthropic-ai/claude-agent-sdk-linux-x64-musl/claude The new SDK probes its own platform-specific optional-dependency subpackages for a bundled `claude` binary. Two failure modes hit at once: 1. Cascade installs `@anthropic-ai/claude-code@2.1.119` globally at /usr/local/bin/claude — the SDK never looks there. 2. The SDK probes the `-musl` variant first regardless of host libc and errors on ENOENT instead of falling through to the glibc variant. Pass an explicit `pathToClaudeCodeExecutable` to short-circuit the probe. The resolver checks (in order): - $CLAUDE_CODE_EXECUTABLE_PATH env override (local-dev escape hatch) - `which claude` in $PATH - /usr/local/bin/claude (Docker default from Dockerfile.worker) Two TDD tests pin the option onto query() and prove the env override wins. No Dockerfile change needed; the existing global install at /usr/local/bin/claude becomes the resolver's runtime target. Confirmed broken on ucho PR #72 (cascade-prod review agent crash). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * spec 015: router job dispatch failure recovery (#1203) * docs(spec/plans): add spec 015 + plans for router dispatch failure recovery Spec captures the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350): a transient capacity miss or Docker error during worker spawn turns a webhook-driven job into a permanently failed BullMQ entry while stranding the work-item / agent-type locks for up to 30 minutes, silently rejecting subsequent webhooks for the same work item. Decomposed into two plans with safety-net-first sequencing: plan 1 hooks the BullMQ failed event to release locks on every dispatch failure path; plan 2 replaces the throw-on-capacity with a wait-for-slot semaphore, adds bounded retry with exponential backoff, and a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 015/1 failed-event-lock-compensation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(router): plan 015/1 done — release locks on dispatch failure Closes the stranded-lock half of spec 015's bug class verified live in prod on 2026-04-26 (ucho/MNG-350). When a webhook-driven job's dispatch fails — capacity throw, Docker spawn error, or any future throw site — the work-item lock, agent-type concurrency counter, and recently-dispatched dedup mark established by the webhook → enqueue path are now released by a compensator hooked to BullMQ's `worker.on('failed')` event. What landed: - `src/router/dispatch-compensator.ts` (new) — `releaseLocksForFailedJob` wraps `extractProjectIdFromJob` / `extractWorkItemId` / `extractAgentType` and calls into `clearWorkItemEnqueued` / `clearAgentTypeEnqueued` / `clearRecentlyDispatched`. Never propagates errors; captures to Sentry with `tags: { source: 'dispatch_compensator' }`. - `src/router/agent-type-lock.ts` — exports new `clearRecentlyDispatched` for the compensator. The existing `markRecentlyDispatched` semantics are unchanged (60s TTL, NOT cleared on completion); this helper exists solely so a permanently-failed dispatch doesn't keep deduping a fresh webhook for ~60s while the user retries. - `src/router/bullmq-workers.ts` — extends the existing `worker.on('failed')` handler to invoke `releaseLocksForFailedJob` alongside the existing logger + Sentry calls. Wraps the call in a defensive `.catch` so a future regression in the compensator can't poison the worker. - `src/router/lock-state-classifier.ts` (new) — `classifyLockState` returns `'awaiting-slot'` when an active worker or queued/waiting job matches the trio, `'wedged'` when neither correlation matches. Defaults to `'awaiting-slot'` on classifier error so a Redis blip doesn't mis-emit the wedged canary. - `src/router/active-workers.ts` — `getActiveWorkers()` now exposes `(projectId, workItemId, agentType)` so the classifier can correlate. Backwards-compatible (existing callers work unchanged; new fields are additive optional). - `src/router/webhook-processor.ts` — Step 8 (work-item lock check) now splits the decision-reason vocabulary into three states: * `Job queued: ...` (success path) * `Awaiting worker slot: ...` (lock held + dispatch in flight; healthy) * `Work item locked (no active dispatch): ...` (wedged-lock canary) The wedged branch additionally fires `captureException` with `tags: { source: 'wedged_lock_canary' }` so any regression in compensation is loud in production. What this does NOT change (intentional, all in plan 015/2): - `guardedSpawn` still throws on capacity (BullMQ marks the job failed, the compensator now releases the locks, but the job itself is still lost). Plan 2 replaces the throw with a wait-for-slot semaphore. - Both queues still default to `attempts: 1`. Plan 2 raises this with exponential backoff and adds a transient/terminal error classifier. - CLAUDE.md is intentionally not updated by this plan — the unified passage describing both halves of the new contract lands in plan 015/2. Tests: - 5 new unit tests in `dispatch-compensator.test.ts` - 3 new unit tests in `agent-type-lock.test.ts` for `clearRecentlyDispatched` - 4 new unit tests in `bullmq-workers.test.ts` for the failed-event seam - 5 new unit tests in `lock-state-classifier.test.ts` - 2 new unit tests in `active-workers.test.ts` for the extended shape - 4 new unit tests in `webhook-processor.test.ts` for the three-way taxonomy - 3 new module-integration tests in `tests/integration/router/dispatch-failure-compensation.test.ts` exercise the real lock modules + real bullmq-workers.ts failed-event handler + real compensator end-to-end (only BullMQ's Worker constructor + the worker-env extractors are mocked). Full suite: 8515 passed / 23 skipped / 0 failed. Lint + typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 015/2 wait-for-slot-and-retry-classifier Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): mark 015/2 status: wip * feat(router): plan 015/2 done — wait-for-slot + retry budget + classifier Closes the lost-job half of spec 015's bug class. Combined with plan 015/1, the silent black-hole failure mode verified live in prod on 2026-04-26 (ucho/MNG-350) is now fully closed. What landed: - `src/router/slot-waiter.ts` (new) — semaphore-style primitive: `acquireSlot({ timeoutMs })` resolves immediately when capacity is below `routerConfig.maxWorkers`, otherwise queues a FIFO waiter with a bounded timeout that rejects with `code: 'SLOT_WAIT_TIMEOUT'`. `slotReleased()` pops the head waiter; `clearAllWaiters()` rejects every pending waiter with `code: 'SHUTDOWN'` on router stop. - `src/router/dispatch-error-classifier.ts` (new) — classifies thrown errors into `'transient'` (Docker socket Node codes, HTTP 429/409, SLOT_WAIT_TIMEOUT, anything unknown — default-to-retry) vs `'terminal'` (TypeError, ZodError, image-not-found-after-fallback). - `src/router/worker-manager.ts` — `guardedSpawn` rewritten: `await acquireSlot(...)` replaces the synchronous capacity throw; on spawn error, terminal errors are wrapped in BullMQ's `UnrecoverableError` so retries skip; transient errors propagate unchanged so BullMQ retries via attempts/backoff. - `src/router/active-workers.ts` — `cleanupWorker` now calls `slotReleased()` exactly once per cleanup, including on the crash path. The existing `if (worker)` guard ensures idempotence. - `src/router/config.ts` — new `slotWaitTimeoutMs` field (default 5min, configurable via `SLOT_WAIT_TIMEOUT_MS`). - `src/router/queue.ts` and `src/queue/client.ts` — both queues now default to `attempts: 4` with `backoff: { type: 'exponential', delay: 5000 }` (~75s total before exhaustion). Terminal errors bypass via `UnrecoverableError`. - `src/router/container-manager.ts` — exports the existing `isImageNotFoundError` predicate so the classifier can reuse it. Test contract change (spec AC #9): The previous `tests/unit/router/worker-manager.test.ts:179` assertion `'processFn throws when at capacity'` is REPLACED (not deleted) with `'processFn awaits a slot when at capacity, then dispatches when one frees'`. The throw-on-capacity contract is gone forever. Tests: - 7 new unit tests in `slot-waiter.test.ts` (FIFO, timeout, no-op, shutdown rejection) - 11 new unit tests in `dispatch-error-classifier.test.ts` covering every transient/terminal class - 4 new unit tests in `worker-manager.test.ts` (replaced original capacity-throw test + 3 for retry classification) - 3 new unit tests in `active-workers.test.ts` for slotReleased integration - 5 new module-integration tests in `dispatch-retry.test.ts` exercise REAL guardedSpawn + REAL slot-waiter + REAL dispatch-error-classifier against both queues, mocking only spawnWorker + BullMQ Worker constructor. Plan 1's 3 module-integration tests continue to pass alongside plan 2's 5. Full unit suite: 8539 passed / 23 skipped / 0 failed. CLAUDE.md updated with a new "Dispatch failure semantics" section documenting the unified contract (capacity wait, retry budget, classifier, three-way decision-reason taxonomy from plan 1, wedged-lock canary). File now 182 lines, under the 200-line cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(spec): 015 done — router job dispatch failure recovery, all plans complete Closes the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350). Plan 1 added failed-event lock compensation + three-way decision-reason taxonomy; plan 2 replaced the throw-on-capacity with wait-for-slot, added bounded retry with exponential backoff, and introduced a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(deps): bump postcss from 8.5.8 to 8.5.12 (#1204) Bumps [postcss](https://github.com/postcss/postcss) from 8.5.8 to 8.5.12. - [Release notes](https://github.com/postcss/postcss/releases) - [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md) - [Commits](postcss/postcss@8.5.8...8.5.12) --- updated-dependencies: - dependency-name: postcss dependency-version: 8.5.12 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump uuid, bullmq and dockerode (#1192) Removes [uuid](https://github.com/uuidjs/uuid). It's no longer used after updating ancestor dependencies [uuid](https://github.com/uuidjs/uuid), [bullmq](https://github.com/taskforcesh/bullmq) and [dockerode](https://github.com/apocas/dockerode). These dependencies need to be updated together. Removes `uuid` Updates `bullmq` from 5.72.0 to 5.76.2 - [Release notes](https://github.com/taskforcesh/bullmq/releases) - [Commits](taskforcesh/bullmq@v5.72.0...v5.76.2) Updates `dockerode` from 4.0.10 to 5.0.0 - [Release notes](https://github.com/apocas/dockerode/releases) - [Commits](apocas/dockerode@v4.0.10...v5.0.0) --- updated-dependencies: - dependency-name: uuid dependency-version: dependency-type: indirect - dependency-name: bullmq dependency-version: 5.76.2 dependency-type: direct:production - dependency-name: dockerode dependency-version: 5.0.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: aaight <aaight42@gmail.com> Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix(triggers): audit & fix PM feedback inconsistencies across respond-to-* agents (#1201) * fix(triggers): audit & fix PM feedback inconsistencies across respond-to-* agents * fix(triggers): use case-insensitive JIRA status comparison in isInPlanningStatus Match the established pattern from status-changed.ts and label-added.ts which both use .toLowerCase() for JIRA status comparisons, since status names are user-configurable and the API does not guarantee consistent casing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(linear): populate inlineMedia from descriptions/comments and add downloadAttachment (#1202) Co-authored-by: Cascade Bot <bot@cascade.dev> * fix(claude-code): pin pathToClaudeCodeExecutable so SDK skips broken native-binary probe (#1206) The agent-harness SDK bump in #1197 (claude-agent-sdk 0.2.91 → 0.2.119) broke every review run on cascade-prod with: ReferenceError: Claude Code native binary not found at /app/node_modules/@anthropic-ai/claude-agent-sdk-linux-x64-musl/claude The new SDK probes its own platform-specific optional-dependency subpackages for a bundled `claude` binary. Two failure modes hit at once: 1. Cascade installs `@anthropic-ai/claude-code@2.1.119` globally at /usr/local/bin/claude — the SDK never looks there. 2. The SDK probes the `-musl` variant first regardless of host libc and errors on ENOENT instead of falling through to the glibc variant. Pass an explicit `pathToClaudeCodeExecutable` to short-circuit the probe. The resolver checks (in order): - $CLAUDE_CODE_EXECUTABLE_PATH env override (local-dev escape hatch) - `which claude` in $PATH - /usr/local/bin/claude (Docker default from Dockerfile.worker) Two TDD tests pin the option onto query() and prove the env override wins. No Dockerfile change needed; the existing global install at /usr/local/bin/claude becomes the resolver's runtime target. Confirmed broken on ucho PR #72 (cascade-prod review agent crash). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * spec 015: router job dispatch failure recovery (#1203) * docs(spec/plans): add spec 015 + plans for router dispatch failure recovery Spec captures the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350): a transient capacity miss or Docker error during worker spawn turns a webhook-driven job into a permanently failed BullMQ entry while stranding the work-item / agent-type locks for up to 30 minutes, silently rejecting subsequent webhooks for the same work item. Decomposed into two plans with safety-net-first sequencing: plan 1 hooks the BullMQ failed event to release locks on every dispatch failure path; plan 2 replaces the throw-on-capacity with a wait-for-slot semaphore, adds bounded retry with exponential backoff, and a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 015/1 failed-event-lock-compensation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(router): plan 015/1 done — release locks on dispatch failure Closes the stranded-lock half of spec 015's bug class verified live in prod on 2026-04-26 (ucho/MNG-350). When a webhook-driven job's dispatch fails — capacity throw, Docker spawn error, or any future throw site — the work-item lock, agent-type concurrency counter, and recently-dispatched dedup mark established by the webhook → enqueue path are now released by a compensator hooked to BullMQ's `worker.on('failed')` event. What landed: - `src/router/dispatch-compensator.ts` (new) — `releaseLocksForFailedJob` wraps `extractProjectIdFromJob` / `extractWorkItemId` / `extractAgentType` and calls into `clearWorkItemEnqueued` / `clearAgentTypeEnqueued` / `clearRecentlyDispatched`. Never propagates errors; captures to Sentry with `tags: { source: 'dispatch_compensator' }`. - `src/router/agent-type-lock.ts` — exports new `clearRecentlyDispatched` for the compensator. The existing `markRecentlyDispatched` semantics are unchanged (60s TTL, NOT cleared on completion); this helper exists solely so a permanently-failed dispatch doesn't keep deduping a fresh webhook for ~60s while the user retries. - `src/router/bullmq-workers.ts` — extends the existing `worker.on('failed')` handler to invoke `releaseLocksForFailedJob` alongside the existing logger + Sentry calls. Wraps the call in a defensive `.catch` so a future regression in the compensator can't poison the worker. - `src/router/lock-state-classifier.ts` (new) — `classifyLockState` returns `'awaiting-slot'` when an active worker or queued/waiting job matches the trio, `'wedged'` when neither correlation matches. Defaults to `'awaiting-slot'` on classifier error so a Redis blip doesn't mis-emit the wedged canary. - `src/router/active-workers.ts` — `getActiveWorkers()` now exposes `(projectId, workItemId, agentType)` so the classifier can correlate. Backwards-compatible (existing callers work unchanged; new fields are additive optional). - `src/router/webhook-processor.ts` — Step 8 (work-item lock check) now splits the decision-reason vocabulary into three states: * `Job queued: ...` (success path) * `Awaiting worker slot: ...` (lock held + dispatch in flight; healthy) * `Work item locked (no active dispatch): ...` (wedged-lock canary) The wedged branch additionally fires `captureException` with `tags: { source: 'wedged_lock_canary' }` so any regression in compensation is loud in production. What this does NOT change (intentional, all in plan 015/2): - `guardedSpawn` still throws on capacity (BullMQ marks the job failed, the compensator now releases the locks, but the job itself is still lost). Plan 2 replaces the throw with a wait-for-slot semaphore. - Both queues still default to `attempts: 1`. Plan 2 raises this with exponential backoff and adds a transient/terminal error classifier. - CLAUDE.md is intentionally not updated by this plan — the unified passage describing both halves of the new contract lands in plan 015/2. Tests: - 5 new unit tests in `dispatch-compensator.test.ts` - 3 new unit tests in `agent-type-lock.test.ts` for `clearRecentlyDispatched` - 4 new unit tests in `bullmq-workers.test.ts` for the failed-event seam - 5 new unit tests in `lock-state-classifier.test.ts` - 2 new unit tests in `active-workers.test.ts` for the extended shape - 4 new unit tests in `webhook-processor.test.ts` for the three-way taxonomy - 3 new module-integration tests in `tests/integration/router/dispatch-failure-compensation.test.ts` exercise the real lock modules + real bullmq-workers.ts failed-event handler + real compensator end-to-end (only BullMQ's Worker constructor + the worker-env extractors are mocked). Full suite: 8515 passed / 23 skipped / 0 failed. Lint + typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 015/2 wait-for-slot-and-retry-classifier Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): mark 015/2 status: wip * feat(router): plan 015/2 done — wait-for-slot + retry budget + classifier Closes the lost-job half of spec 015's bug class. Combined with plan 015/1, the silent black-hole failure mode verified live in prod on 2026-04-26 (ucho/MNG-350) is now fully closed. What landed: - `src/router/slot-waiter.ts` (new) — semaphore-style primitive: `acquireSlot({ timeoutMs })` resolves immediately when capacity is below `routerConfig.maxWorkers`, otherwise queues a FIFO waiter with a bounded timeout that rejects with `code: 'SLOT_WAIT_TIMEOUT'`. `slotReleased()` pops the head waiter; `clearAllWaiters()` rejects every pending waiter with `code: 'SHUTDOWN'` on router stop. - `src/router/dispatch-error-classifier.ts` (new) — classifies thrown errors into `'transient'` (Docker socket Node codes, HTTP 429/409, SLOT_WAIT_TIMEOUT, anything unknown — default-to-retry) vs `'terminal'` (TypeError, ZodError, image-not-found-after-fallback). - `src/router/worker-manager.ts` — `guardedSpawn` rewritten: `await acquireSlot(...)` replaces the synchronous capacity throw; on spawn error, terminal errors are wrapped in BullMQ's `UnrecoverableError` so retries skip; transient errors propagate unchanged so BullMQ retries via attempts/backoff. - `src/router/active-workers.ts` — `cleanupWorker` now calls `slotReleased()` exactly once per cleanup, including on the crash path. The existing `if (worker)` guard ensures idempotence. - `src/router/config.ts` — new `slotWaitTimeoutMs` field (default 5min, configurable via `SLOT_WAIT_TIMEOUT_MS`). - `src/router/queue.ts` and `src/queue/client.ts` — both queues now default to `attempts: 4` with `backoff: { type: 'exponential', delay: 5000 }` (~75s total before exhaustion). Terminal errors bypass via `UnrecoverableError`. - `src/router/container-manager.ts` — exports the existing `isImageNotFoundError` predicate so the classifier can reuse it. Test contract change (spec AC #9): The previous `tests/unit/router/worker-manager.test.ts:179` assertion `'processFn throws when at capacity'` is REPLACED (not deleted) with `'processFn awaits a slot when at capacity, then dispatches when one frees'`. The throw-on-capacity contract is gone forever. Tests: - 7 new unit tests in `slot-waiter.test.ts` (FIFO, timeout, no-op, shutdown rejection) - 11 new unit tests in `dispatch-error-classifier.test.ts` covering every transient/terminal class - 4 new unit tests in `worker-manager.test.ts` (replaced original capacity-throw test + 3 for retry classification) - 3 new unit tests in `active-workers.test.ts` for slotReleased integration - 5 new module-integration tests in `dispatch-retry.test.ts` exercise REAL guardedSpawn + REAL slot-waiter + REAL dispatch-error-classifier against both queues, mocking only spawnWorker + BullMQ Worker constructor. Plan 1's 3 module-integration tests continue to pass alongside plan 2's 5. Full unit suite: 8539 passed / 23 skipped / 0 failed. CLAUDE.md updated with a new "Dispatch failure semantics" section documenting the unified contract (capacity wait, retry budget, classifier, three-way decision-reason taxonomy from plan 1, wedged-lock canary). File now 182 lines, under the 200-line cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(spec): 015 done — router job dispatch failure recovery, all plans complete Closes the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350). Plan 1 added failed-event lock compensation + three-way decision-reason taxonomy; plan 2 replaced the throw-on-capacity with wait-for-slot, added bounded retry with exponential backoff, and introduced a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(deps): bump postcss from 8.5.8 to 8.5.12 (#1204) Bumps [postcss](https://github.com/postcss/postcss) from 8.5.8 to 8.5.12. - [Release notes](https://github.com/postcss/postcss/releases) - [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md) - [Commits](postcss/postcss@8.5.8...8.5.12) --- updated-dependencies: - dependency-name: postcss dependency-version: 8.5.12 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump uuid, bullmq and dockerode (#1192) Removes [uuid](https://github.com/uuidjs/uuid). It's no longer used after updating ancestor dependencies [uuid](https://github.com/uuidjs/uuid), [bullmq](https://github.com/taskforcesh/bullmq) and [dockerode](https://github.com/apocas/dockerode). These dependencies need to be updated together. Removes `uuid` Updates `bullmq` from 5.72.0 to 5.76.2 - [Release notes](https://github.com/taskforcesh/bullmq/releases) - [Commits](taskforcesh/bullmq@v5.72.0...v5.76.2) Updates `dockerode` from 4.0.10 to 5.0.0 - [Release notes](https://github.com/apocas/dockerode/releases) - [Commits](apocas/dockerode@v4.0.10...v5.0.0) --- updated-dependencies: - dependency-name: uuid dependency-version: dependency-type: indirect - dependency-name: bullmq dependency-version: 5.76.2 dependency-type: direct:production - dependency-name: dockerode dependency-version: 5.0.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * spec 016: PM image delivery reliability (#1209) * docs(spec/plans): add spec 016 + plans for PM image delivery reliability Closes the silent screenshot-drop bug class verified live on 2026-04-26 (ucho/MNG-357): Linear's user-pasted-image URLs (uploads.linear.app/<uuid> with no file extension) were dropped at the pre-download MIME filter because mimeTypeFromUrl returned 'application/octet-stream' and filterImageMedia excluded them. This affected all engines on the disk-write path, regardless of PR #948's Claude-Code SDK delivery fix. Three plans, safety-net-first sequencing matching spec 015: - Plan 1 (boot-path-mime-fix-and-diagnostic-log): defers MIME authority to download response Content-Type via image/* wildcard sentinel; adds the grep-stable diagnostic log line at extract time. Independently fixes MNG-357. - Plan 2 (runtime-gadget-image-delivery): makes the runtime cascade-tools pm read-work-item gadget actually download + write images to disk with file paths returned in text. Closes the mid-run pickup gap. Depends on Plan 1's shared download-and-prepare helper. - Plan 3 (linear-fixture-and-extraction-coverage): captures a Linear GraphQL Issue payload fixture for an issue with a pasted screenshot; pins extraction with a regression test that fails loudly if Linear ever changes the payload shape. Mostly tests + docs. 9 ACs, 0 manual-only. CLAUDE.md not updated (already covered by spec 015's silent-failure → diagnostic-line pattern). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 016/1 boot-path-mime-fix-and-diagnostic-log * feat(pm): plan 016/1 done — boot-path image MIME fix + diagnostic log Closes the silent screenshot-drop bug class verified live on 2026-04-26 (ucho/MNG-357). Linear's extension-less pasted-image URLs (uploads.linear.app/<uuid>) now survive the pre-download MIME filter via an image/* wildcard sentinel. The download response's Content-Type header is the authoritative MIME — wildcard is resolved before bytes are written. What landed: - src/pm/media.ts — new IMAGE_HOST_ALLOWLIST (currently 'uploads.linear.app'); mimeTypeFromUrl returns 'image/*' for extension-less URLs from allowlisted hosts; isImageMimeType accepts the wildcard. - src/pm/download-and-prepare.ts (new) — shared helper for the per-provider download dispatch loop (jira/linear/trello). Returns { images, failures }. Spec 016/2's runtime gadget will import this. - src/agents/definitions/contextSteps.ts — fetchWorkItemStep refactored to use the shared helper; emits the new grep-stable diagnostic line '[image-pipeline] work-item-fetch summary' with stable fields: { provider, workItemId, urlsDetected, urlsAfterFilter, urlsDownloaded, urlsFailed, urlsByMimeType }. Tests: - 6 new unit tests in tests/unit/pm/media.test.ts (wildcard sentinel, Linear extension-less, regression for extensioned + non-PM URLs) - 7 new unit tests in tests/unit/pm/download-and-prepare.test.ts - 3 new diagnostic-log tests in contextSteps.test.ts; existing log message expectations updated to the new helper-prefix - 3 module-integration tests in tests/integration/pm/image-pipeline.test.ts pinning the MNG-357 reproduction end-to-end with real mimeTypeFromUrl + filterImageMedia + extractMarkdownImages PR #948's Claude-Code initial-input ImageBlockParam path is unchanged; existing regression test (claude-code.test.ts:939 'logs image injection and strips images before buildTaskPrompt') confirms. Docs: - CHANGELOG.md entry under Unreleased. - src/integrations/README.md gains a new 'Image delivery contract' section documenting the shared resolution path, allowlist semantics, diagnostic log line schema, and the rule that providers shouldn't write their own MIME-detection. Full unit suite: 8521 passed / 23 skipped / 0 failed. Lint + typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 016/2 runtime-gadget-image-delivery * feat(pm): plan 016/2 done — runtime gadget delivers images on disk Closes the mid-run image pickup gap from spec 016. The runtime gadget `cascade-tools pm read-work-item` now downloads any image media and writes it to .cascade/context/images/work-item-<id>-img-<index>.<ext>, returning text whose new "Local Image Files" section lists actual file paths the agent's file-read tool can consume. What landed: - src/gadgets/pm/core/writeRuntimeImages.ts (new) — writes ContextImage arrays to .cascade/context/images/ with stable naming convention (work-item-<id>-img-<i>.<ext>); extension derived from resolved MIME; falls back to .bin + warn log for unresolved image/* sentinel. - src/gadgets/pm/core/readWorkItem.ts — readWorkItem now calls downloadAndPrepareImages (Plan 1's helper) + writeRuntimeImages (this plan), then mutates the returned text to include the local file paths via formatRuntimeImagePaths. Same diagnostic log line '[image-pipeline] work-item-fetch summary' as the boot path. Failed downloads surface in a "Failed Image Downloads" subsection. Tests: - 8 new unit tests in tests/unit/gadgets/pm/core/writeRuntimeImages.test.ts - 5 new unit tests in tests/unit/gadgets/pm/core/readWorkItem.test.ts (spec 016/2 sub-describe) - 4 new module-integration tests in tests/integration/gadgets/runtime-image-delivery.test.ts pinning the mid-run pickup contract end-to-end. CHANGELOG.md entry added. Full unit suite (single-fork): 8534 passed / 23 skipped / 0 failed. Lint + typecheck clean. Three PM manifest test suites occasionally time out under parallel load on this machine — verified to pass in isolation; not a code regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 016/3 linear-fixture-and-extraction-coverage * test(pm): plan 016/3 done — Linear fixture + extraction-coverage regression Closes spec 016 with the regression net for the contract Plans 1+2 established. If Linear ever changes its Issue payload shape in a way that loses inline images, the extraction-coverage test fails loudly with a specific URL-missing message. What landed: - tests/fixtures/linear-issue-with-screenshot.json (new) — reconstructed Linear GraphQL Issue payload covering: extension-less uploads.linear.app URL in description, extensioned Linear URL with alt text, external URL with image/svg+xml MIME, non-image markdown link (must NOT be picked up), one comment with a pasted screenshot, one comment without, and three formal Attachment records (Slack/GitHub/Sentry link previews). - tests/unit/pm/linear/extraction-coverage.test.ts (new) — 9 tests: description coverage with explicit expected-URL list, image/* sentinel for extension-less, concrete MIME for extensioned, image/svg+xml for external SVG, non-image link exclusion, comment coverage, comment source field, attachment-NOT-leaked rule, meta-test of regression net. - src/integrations/README.md — new "Linear: GraphQL surface for inline images" subsection documenting the conclusion: Issue.description markdown is canonical for inline-pasted screenshots; Issue.attachments is for formal Attachment records (link previews) and is the wrong surface for inline images. Links to the fixture and the test. No production code change — Plan 1's mimeTypeFromUrl + extractMarkdownImages already cover the cases. This plan ships the regression armor. CHANGELOG.md entry added. Lint + typecheck clean. 9/9 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(spec): 016 done — pm-image-delivery-reliability, all plans complete Closes the silent screenshot-drop bug class verified live on 2026-04-26 (ucho/MNG-357). Plan 1 added the Linear-extension-less MIME wildcard sentinel + diagnostic log line; plan 2 made the runtime cascade-tools pm read-work-item gadget actually deliver images on disk; plan 3 captured a Linear GraphQL fixture and pinned extraction coverage with a regression test. CLAUDE.md untouched by this spec — already covered by spec 015's broader silent-failure → diagnostic-line pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address code review concerns * test(image-pipeline): supply urlsDetected in readWorkItemWithMedia mocks The diagnostic-line assertion expected urlsDetected on the log payload, but the mocked readWorkItemWithMedia return values omitted it, so the field arrived as undefined and the toHaveBeenCalledWith match failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cascade Bot <bot@cascade.dev> * fix(linear): drop comment-mention planning-state gate that prod payload never satisfies (#1210) PR #1201 added a `currentStateId !== planningStateId` gate to the Linear comment @mention trigger that read `data.issue.stateId` from the webhook payload. Linear's Comment webhook does not ship `stateId` on the nested issue (verified across four prod payloads on 2026-04-26 — 8cd0108a, b93e4925, 6548cd14, 3d95b210). The gate therefore always evaluated to true and silently dropped every legitimate bot @mention, including the one on MNG-346 that motivated this fix. The agent (respond-to-planning-comment) is now responsible for any planning-only behavior; the trigger no longer gates on state and avoids an extra Linear GraphQL round-trip per comment. Also corrects `LinearWebhookCommentTriggerData.issue` to match what Linear actually ships (six keys, no `stateId`, optional `team`) — the old type lied and PR #1201 trusted it. Tests pin a real prod-shape Comment payload as a regression. JIRA's equivalent gate is unaffected (its `comment_created` payload does ship `issue.fields.status.name`). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: aaight <aaight42@gmail.com> Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

zbigniewsobiecki merged commit d3415ea into dev Jan 1, 2026
2 checks passed

zbigniewsobiecki deleted the fix/postgresql-startup-reliability branch January 1, 2026 20:20

zbigniewsobiecki mentioned this pull request Apr 15, 2026

feat(pm): status mapping parity across Linear and JIRA #1114

Merged

8 tasks

zbigniewsobiecki mentioned this pull request Apr 18, 2026

feat(pm): integration hardening follow-ups + wizard shared-component migration (specs 010 + 011/1-2) #1148

Merged

7 tasks

This was referenced Apr 18, 2026

feat(pm): webhook UX manifest migration + legacy WebhookStep retirement (spec 012) #1149

Merged

feat(subprocess): observable subprocess helper — streaming + heartbeat + dual timeouts (spec 013) #1177

Merged

zbigniewsobiecki mentioned this pull request Apr 26, 2026

spec 015: router job dispatch failure recovery #1203

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve PostgreSQL startup reliability and agent documentation#9

fix: improve PostgreSQL startup reliability and agent documentation#9
zbigniewsobiecki merged 1 commit intodevfrom
fix/postgresql-startup-reliability

zbigniewsobiecki commented Jan 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zbigniewsobiecki commented Jan 1, 2026

Summary

Changes

src/agents/utils/setup.ts

Dockerfile

src/agents/prompts/templates/partials/environment.eta

Context

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`src/agents/utils/setup.ts`

`Dockerfile`

`src/agents/prompts/templates/partials/environment.eta`