fix: improve PostgreSQL startup reliability and agent documentation#9
Merged
zbigniewsobiecki merged 1 commit intodevfrom Jan 1, 2026
Merged
Conversation
- Add robust error handling to startPostgres() that fails fast if PostgreSQL cannot start, preventing agent sessions from proceeding with a broken database - Configure PostgreSQL in Dockerfile with password authentication: - User: postgres, Password: postgres - Connection string: postgresql://postgres:postgres@localhost:5432/postgres - Local socket uses trust, TCP connections require md5 password - Update agent environment prompt with complete PostgreSQL documentation: - Connection string and CLI access - Start/stop/status commands using pg_ctl - Database creation example This addresses issues where agents would waste iterations trying to troubleshoot PostgreSQL connectivity using incorrect commands (systemctl, brew services) that don't work in the container environment. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
zbigniewsobiecki
added a commit
that referenced
this pull request
Apr 15, 2026
Plan 003/1 (status-parity). Linear PM wizard Field Mapping step now exposes all 8 CASCADE stages (backlog, splitting, planning, todo, inProgress, inReview, done, merged) in lifecycle order instead of only 4. Linear operators can now map workflow states to splitting/planning/todo/merged and have the corresponding agents dispatch on issue transitions. JIRA's silent-drop bug in resolveLifecycleConfig fixed: splitting/ planning/todo mappings the JIRA wizard already accepted now surface through to PMLifecycleManager and GitHub PR triggers. No operator action required. Canonical ProjectPMConfig.statuses widens to declare the full 9-stage vocabulary (including debug, reserved for future trigger), so providers can no longer silently drift from the trigger layer. Existing Linear integrations upgrade in place: new slots render as 'not set' on next wizard visit. No migration. Tests: 9 new unit tests (type shape + Linear + JIRA integration + SSR wizard). Integration coverage for spec ACs #8/#9 provided by existing linear-status-changed and jira-status-changed trigger handler tests — discovered during Phase 4 that handlers read provider-specific config directly (not resolveLifecycleConfig), so dispatch was never blocked for handlers; the drop was in downstream PMLifecycleManager callers. Totals: 7597 unit + 522 integration all green. Lint + typecheck clean. Closes spec 003. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 tasks
zbigniewsobiecki
added a commit
that referenced
this pull request
Apr 15, 2026
* docs(plans): add spec 003 + plan; lock plan 003/1 Spec 003 introduces PM status mapping parity across Linear and JIRA. Plan 1 (status-parity) locked for execution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(plans): plan 003/1 frontmatter status -> wip Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(pm): status mapping parity across Linear and JIRA Plan 003/1 (status-parity). Linear PM wizard Field Mapping step now exposes all 8 CASCADE stages (backlog, splitting, planning, todo, inProgress, inReview, done, merged) in lifecycle order instead of only 4. Linear operators can now map workflow states to splitting/planning/todo/merged and have the corresponding agents dispatch on issue transitions. JIRA's silent-drop bug in resolveLifecycleConfig fixed: splitting/ planning/todo mappings the JIRA wizard already accepted now surface through to PMLifecycleManager and GitHub PR triggers. No operator action required. Canonical ProjectPMConfig.statuses widens to declare the full 9-stage vocabulary (including debug, reserved for future trigger), so providers can no longer silently drift from the trigger layer. Existing Linear integrations upgrade in place: new slots render as 'not set' on next wizard visit. No migration. Tests: 9 new unit tests (type shape + Linear + JIRA integration + SSR wizard). Integration coverage for spec ACs #8/#9 provided by existing linear-status-changed and jira-status-changed trigger handler tests — discovered during Phase 4 that handlers read provider-specific config directly (not resolveLifecycleConfig), so dispatch was never blocked for handlers; the drop was in downstream PMLifecycleManager callers. Totals: 7597 unit + 522 integration all green. Lint + typecheck clean. Closes spec 003. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(specs): spec 003 (linear-status-mapping-parity) done — all plans complete Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
zbigniewsobiecki
added a commit
that referenced
this pull request
Apr 18, 2026
First real consumer of the shared wizard components. Trello's legacy
per-provider step file (pm-wizard-trello-steps.tsx) has no live importers
outside itself; deletion deferred to plan 011/5.
- TrelloOAuthStep: new custom step at pm-providers/trello/oauth-step.tsx.
Lifts the window.open popup + manual-token fallback verbatim from the
legacy TrelloCredentialsStep. Registered as kind:'custom' with
component:'TrelloOAuthStep' in trelloManifest.wizardSpec.
- trelloManifest.wizardSpec.steps: now [custom(TrelloOAuthStep),
container-pick, status-mapping, label-mapping, custom-field-mapping,
webhook-url-display] — 6 steps, one of them custom.
- trelloProviderWizard.steps: rewritten to consume shared components via
thin per-step adapters. useProviderHooks returns the flat shape each
adapter slices (boardOptions, providerStates, providerLabels,
providerCustomFields, onCreate* callbacks). Adapters call shared
components directly with Trello-specific props.
- Adapters file (trello/adapters.tsx): deleted — orphaned after the
wizard rewrite.
- useTrelloCustomFieldCreation: now accepts { name: string } argument
(was hard-coded "Cost"). Enables the shared Create-form UX.
Forward-edit to plan 011/1 (additive, existing tests unchanged):
- label-mapping widened with optional labelDefaults?: Record<slot,
{name, color?}> — pre-populates Create input, threads color to
onCreateLabel. Trello uses it for cascade-ready/processing/etc.
- custom-field-mapping widened with optional fieldDefaults?: Record<slot,
{name}> — pre-populates Create input. Trello uses it for cost field.
Normalize-upward UX changes (user-approved in behavior inventory):
- Dropped retry button on board-picker error (shared component shows
error text; operator refreshes page).
- Dropped "Create All Missing Labels" batch button (per-slot Create
covers the same ground, one click at a time).
AC #9 (no operator regression) marked **deferred** — browser smoke test
pending reviewer verification. Unit tests + conformance harness cover
wire-level invariants; no runtime behavior change in adapters
(discovery / label-creation / custom-field-creation hooks reused
unchanged).
19 Trello tests: 5 wizardSpec + 7 oauth-step + 7 wizard-generator. Plus
2 forward-edit tests on the widened shared steps. Full suite 8169/8169,
lint + typecheck + build all green.
Closes plan 011/2 of spec 011.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
7 tasks
zbigniewsobiecki
added a commit
that referenced
this pull request
Apr 18, 2026
…migration (specs 010 + 011/1-2) (#1148) * docs(010): add spec + plans for PM integration hardening followups * chore(010/1): lock plan 1 as .wip * feat(010/1): manifest createCustomField hook + pm.discovery mutations * chore(010/1): mutations complete, plan done * chore(010/2): lock plan 2 as .wip * feat(010/2): currentUser discovery capability + provider implementations * chore(010/2): read cleanup done, currentUser UX restored * chore(010/3): lock plan 3 with narrowed scope (option B) * feat(010/3): wizard-components done — shared step components + generator dispatch Upgrades the wizard generator from spec-010/1 placeholders to real shared React components for every StandardStepKind. Six new components at web/src/components/projects/pm-providers/steps/*.tsx: credentials, container-pick, status-mapping, label-mapping, webhook-url-display, project-scope. Generator exports STANDARD_STEP_COMPONENTS registry and dispatches through it; unknown kinds still warn-once and render a placeholder. Trello/JIRA/Linear wizards keep their per-provider step adapters from the spec-006 era — a future plan migrates them. The shared path is live for new providers today. new-provider-surface snapshot is tightened to pin the six new files; wizard-generator + per-provider manifest-wizardSpec tests now assert element.type identity against the registry instead of placeholder DOM shapes. 55 new/updated tests, all green. Docs updated: src/integrations/README.md (post-spec-010 additions), root CLAUDE.md (PM-integration summary), spec 009 forward-references spec 010, CHANGELOG entries for specs 009 + 010. Closes plan 010/3 of spec 010. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * chore(010): spec done — all plans complete All three plans of spec 010 (PM integration hardening follow-ups) shipped. Mutations (010/1) added generic pm.discovery.createLabel / createCustomField endpoints. Read cleanup (010/2) added currentUser discovery capability. Wizard components (010/3) landed real shared React components for every StandardStepKind. Spec marked .done. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * docs(011): spec + plan decomposition for wizard shared migration Spec 011 (PM Wizard Shared Migration) and its 5-plan decomposition. Migrates Trello/JIRA/Linear wizards onto the shared StandardStepKind components landed by spec 010 — closes the "zero per-provider step code" promise across all three production providers, not just new providers. Plans: 1-shared-components (widen container-pick/project-scope with searchable mode + widen webhook-url-display with optional signing- secret + add 7th StandardStepKind: custom-field-mapping), 2-trello (first consumer; OAuth stays kind:'custom'), 3-jira (issue-type stays kind:'custom'; free-text label mode), 4-linear (retire LinearWebhook- InfoPanel in favor of widened shared component), 5-cleanup (delete pm-wizard-{trello,jira,linear}-steps.tsx + final docs rewrite). Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * chore(011/1): lock plan 1 (shared-components) Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * feat(011/1): shared-components done — widen 3 steps + add custom-field-mapping kind Foundation plan for the wizard migration. Three additive widenings + one new StandardStepKind. All changes dormant until plan 2 activates them; the 31 spec-010 step tests pass unchanged as the backward-compat proof. - container-pick gains optional searchable?: boolean → dispatches to the existing shared Combobox (cmdk + radix) when true. - project-scope gains the same searchable? prop; empty value still means "no scope" in both render paths. - webhook-url-display gains optional secretFieldRole / secretLabel / secretValue / onSecretChange → renders an inline <input type="password"> below the URL when both role + callback are supplied. Defensive: omits the input if role is set but callback is not (avoids uncontrolled secret inputs silently dropping user input). - 7th StandardStepKind: 'custom-field-mapping'. New shared component at web/src/components/projects/pm-providers/steps/custom-field-mapping.tsx renders one row per CASCADE slot with a dropdown of discovered provider custom fields + optional inline "Create…" affordance wired to manifest.createCustomField (spec 010/1). Visual idiom matches status-mapping. - STANDARD_STEP_COMPONENTS registers the new kind; generator dispatch falls through the existing switch path. - new-provider-surface snapshot pins the 7th file. Tests use element-tree identity checks where SSR would hit the React instance mismatch (radix lives in web/node_modules and pulls its own React). 17 new/updated test assertions across 4 files. Full suite 8153/8153, lint 0/0, typecheck + build green. Closes plan 011/1 of spec 011. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * chore(011/2): lock plan 2 (trello) Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * feat(011/2): trello done — wizard migrated to shared step components First real consumer of the shared wizard components. Trello's legacy per-provider step file (pm-wizard-trello-steps.tsx) has no live importers outside itself; deletion deferred to plan 011/5. - TrelloOAuthStep: new custom step at pm-providers/trello/oauth-step.tsx. Lifts the window.open popup + manual-token fallback verbatim from the legacy TrelloCredentialsStep. Registered as kind:'custom' with component:'TrelloOAuthStep' in trelloManifest.wizardSpec. - trelloManifest.wizardSpec.steps: now [custom(TrelloOAuthStep), container-pick, status-mapping, label-mapping, custom-field-mapping, webhook-url-display] — 6 steps, one of them custom. - trelloProviderWizard.steps: rewritten to consume shared components via thin per-step adapters. useProviderHooks returns the flat shape each adapter slices (boardOptions, providerStates, providerLabels, providerCustomFields, onCreate* callbacks). Adapters call shared components directly with Trello-specific props. - Adapters file (trello/adapters.tsx): deleted — orphaned after the wizard rewrite. - useTrelloCustomFieldCreation: now accepts { name: string } argument (was hard-coded "Cost"). Enables the shared Create-form UX. Forward-edit to plan 011/1 (additive, existing tests unchanged): - label-mapping widened with optional labelDefaults?: Record<slot, {name, color?}> — pre-populates Create input, threads color to onCreateLabel. Trello uses it for cascade-ready/processing/etc. - custom-field-mapping widened with optional fieldDefaults?: Record<slot, {name}> — pre-populates Create input. Trello uses it for cost field. Normalize-upward UX changes (user-approved in behavior inventory): - Dropped retry button on board-picker error (shared component shows error text; operator refreshes page). - Dropped "Create All Missing Labels" batch button (per-slot Create covers the same ground, one click at a time). AC #9 (no operator regression) marked **deferred** — browser smoke test pending reviewer verification. Unit tests + conformance harness cover wire-level invariants; no runtime behavior change in adapters (discovery / label-creation / custom-field-creation hooks reused unchanged). 19 Trello tests: 5 wizardSpec + 7 oauth-step + 7 wizard-generator. Plus 2 forward-edit tests on the widened shared steps. Full suite 8169/8169, lint + typecheck + build all green. Closes plan 011/2 of spec 011. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * chore(011/3): lock plan 3 (jira) Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * feat(011/3): jira done — wizard migrated to shared step components Second real consumer of the shared wizard components. JIRA's legacy per-provider step file (pm-wizard-jira-steps.tsx) has no live importers outside itself; deletion deferred to plan 011/5. - IssueTypeMappingStep: new JIRA-specific custom step at pm-providers/jira/issue-type-step.tsx. Maps CASCADE task/subtask roles to JIRA issue types (filtered by the `subtask` flag). Stays kind:'custom' rather than becoming an 8th StandardStepKind because JIRA is the sole consumer today — speculative abstraction avoided. - jiraManifest.wizardSpec.steps: now [credentials, container-pick, status-mapping, label-mapping, custom-field-mapping, custom(IssueTypeMappingStep), webhook-url-display] — 7 steps, one custom. - jiraProviderWizard.steps: rewritten to consume shared components via thin per-step adapters. Credentials step uses the shared `CredentialsStep` with a synthetic `base_url` role alongside email + api_token — no OAuth popup needed for JIRA (unlike Trello). Label mapping passes providerLabels: [] so the shared step renders in free-text mode (JIRA labels are free-form). - Adapters file (jira/adapters.tsx): deleted — orphaned after rewrite. - useJiraCustomFieldCreation: now accepts { name: string } argument (was hard-coded "Cost") so the shared Create affordance works. Task 1 behavior inventory found the same 4 gap classes Trello surfaced; all four were already closed by plan 011/2's forward-edit to plan 011/1 (labelDefaults + fieldDefaults additive widenings). No additional shared-component changes were required for JIRA. AC #10 (no operator regression) marked **deferred** — browser smoke test pending reviewer verification on the deployed branch. Unit tests + conformance harness cover wire-level invariants; legacy discovery + custom-field hooks reused unchanged (only the name-arg tweak on the custom-field mutation). 17 JIRA tests: 6 manifest + 7 issue-type + 4 wizard-generator. JIRA had zero dedicated wizard-step tests before this plan — this is the first JIRA wizard coverage landing. Full suite 8185/8185, lint + typecheck + build all green. Closes plan 011/3 of spec 011. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * chore(011/4): lock plan 4 (linear) Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * feat(011/4): linear done — + parent-wizard fix for plans 2+3 regression Third real consumer of the shared wizard components. Plan 011/4 also ships a critical fix for a regression plans 011/2 and 011/3 introduced: pm-wizard.tsx hardcoded 3 manifest step slots (stepIndex 0/1/2) from the spec-006 era. Trello/JIRA wizardSpecs grew to 6+ steps; only the first 3 rendered on the deploy — label-mapping, custom-field-mapping, and issue-type-mapping steps were INVISIBLE in production. Fix: pm-wizard.tsx now iterates over `manifestDef.steps`, rendering one WizardStep slot per entry. Webhook steps (id ends with `-webhook`) are filtered out — the legacy WebhookStep still owns programmatic webhook registration (Trello/JIRA API calls) and Linear's signing-secret UX. The shared `webhook-url-display` component (widened in plan 011/1) remains dormant for the three existing providers until a follow-up plan migrates webhook-creation UX into the manifest path. Linear wizard migration: - linearProviderWizard.steps: rewritten to consume shared components via 6 thin per-step adapters. No kind:'custom' steps — Linear has no OAuth popup (like Trello) and no issue-type mapping (like JIRA). - LinearWebhookDisplayAdapter: Fragment composing shared WebhookUrlDisplayStep + ProjectSecretField (LINEAR_WEBHOOK_SECRET). Currently dormant; activates after legacy WebhookStep migration. - project-scope step (spec 005): uses the shared ProjectScopeStep with `searchable: true`. - label-mapping: uses shared component with LINEAR_LABEL_DEFAULTS (plan 011/1 forward-edit) pre-populating the Create input with cascade-ready/processing/etc. and threading hex colors. - Adapters file (linear/adapters.tsx): deleted — orphaned after rewrite. - Legacy step tests deleted: linear-field-mapping-step.test.ts, linear-team-step.test.ts, linear-webhook-info-panel.test.ts (-450 lines). Replaced by 8-test linear-wizard-generator.test.ts covering the wizard wiring + manifest↔definition parity. AC #3 (inline webhook secret), #6 (LinearWebhookInfoPanel retired), and #11 (no operator regression) marked **partial/deferred** — see Progress section for details. All other ACs green. Full suite 8167/8167, lint + typecheck + build all green. Closes plan 011/4 of spec 011. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * chore(011/5): lock plan 5 (cleanup) Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * feat(011/5): cleanup done — deleted 3 legacy step files + docs rewrite Closes spec 011 per user-approved option (a) — tight scope: deletions + docs. Scope-clipped items (full LinearWebhookInfoPanel retirement, Linear inline-secret via shared component) carry over as follow-up work; rationale is captured in the plan's Progress section. Deletions: - web/src/components/projects/pm-wizard-trello-steps.tsx (retired since plan 011/2; no live importers) - web/src/components/projects/pm-wizard-jira-steps.tsx (since 011/3) - web/src/components/projects/pm-wizard-linear-steps.tsx (since 011/4) - pm-wizard.tsx dead comments about transitive imports of the above Audits: - pm-wizard-common-steps.tsx — all three remaining exports (LinearWebhookInfoPanel, WebhookStep, SaveStep) still have live consumers via pm-wizard.tsx. File retained. - Dead-code grep: only doc-comment references to the deleted files remain; no live imports. Docs: - src/integrations/README.md — four-specs preamble (006/009/010/011); "seven kinds" in "Adding a new PM provider" step 3; Post-spec-011 additions table alongside the Post-spec-010 one. - CLAUDE.md (project root) — PM-integration summary references spec 011. - CHANGELOG.md — Internal entry for spec 011 alongside 009/010. - docs/specs/010-pm-integration-hardening-followups.md.done — forward-reference blockquote to spec 011. Verification: - npm test: 8167 passed, 23 skipped - npm run lint: clean - npm run typecheck: green - npm run build: green - Conformance harness: all three providers pass - new-provider-surface guard: 7 step files pinned Closes plan 011/5 of spec 011. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> * chore(011): spec done — all plans complete Five plans of spec 011 (PM Wizard Shared Migration) shipped. Shared components widened (plan 1), Trello migrated (plan 2), JIRA migrated (plan 3), Linear migrated + pm-wizard.tsx parent refactor (plan 4), legacy per-provider step files deleted + docs closed (plan 5). Spec marked .done. Deferred to follow-up spec: - Full migration of webhook-creation UX (Trello/JIRA programmatic webhook registration + Linear signing-secret persistence) into the manifest path. Legacy WebhookStep + LinearWebhookInfoPanel still render for this. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4 (1M context) <noreply@anthropic.com>
This was referenced Apr 18, 2026
zbigniewsobiecki
added a commit
that referenced
this pull request
Apr 25, 2026
…tructured envelope, --comment alias) (#1190) * docs(014): spec + plans for cascade-tools agent ergonomics Adds docs/specs/014-cascade-tools-agent-ergonomics.md plus two plans covering shared-infra and create-pr-review adoption. Prompted by prod run 5d993b04-6e05-4ae1-b7de-8c274cf3496b. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan-014): lock plan 1 (shared-infra) * feat(cascade-tools): plan 014/1 shared-infra — truthful prompts + envelope Ships the root-cause fix for prod run 5d993b04-6e05-4ae1-b7de-8c274cf3496b plus the shared infrastructure every future gadget inherits: - System-prompt renderer (src/backends/shared/nativeToolPrompts.ts) stops stripping trailing 's' from array param names and claiming '<string> (repeatable)' for every array. Array-of-object params now render as `--<flag> '<json>'` with aliases appended via `|` and a one-line runnable example from the tool definition. - Factory (src/gadgets/shared/cliCommandFactory.ts) gains oclif flag aliases, JSON parsing for array-of-object flags, file-input JSON parsing, `examples` wired into oclif `--help`, and Levenshtein-based 'did you mean' suggestions for mistyped flags (via fastest-levenshtein). - New shared error envelope (src/gadgets/shared/errorEnvelope.ts) — every CLI failure emits `{"success":false,"error":{type,flag?,message,got?, expected?,hint?,example?}}` on stdout plus a one-line prose summary on stderr. All prior `this.error()` / flat `{success:false,error:"<string>"}` call sites migrated. - Contracts widened: ParameterDefinition gains `cliAliases`, FileInput- Alternative gains `parseAs`, ToolManifest parameters carry `items`, `aliases`, `example`. - Manifest generator threads the new fields through. - bin/cascade-tools.js wraps `run()` to swallow oclif ExitError cleanly so the envelope isn't obscured by Node's default stack dump. Plan-1 ACs #1–#17 all delivered. 8438/8438 unit tests passing. Test surface delta: 57 new unit tests across errorEnvelope.test.ts, shared-nativeToolPrompts.test.ts, and factories.test.ts. Seven legacy assertions encoding the pre-014 error surface updated in cli/cli-command- factory, cli/file-input-flags, cli/scm/create-pr-sidecar, cli/scm/create- pr-review-sidecar, backends/claude-code. Plan 2 adopts the pattern on createPRReviewDef — zero shared-file edits — proving the declarative-metadata invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan-014): lock plan 2 (createprreview-adopt) * feat(cascade-tools): plan 014/2 createprreview-adopt + spec done Applies the spec-014 declarative-metadata pattern to createPRReviewDef: - --comment alias for --comments (the exact muscle-memory mistake from prod run 5d993b04-6e05-4ae1-b7de-8c274cf3496b). - --comments-file <path> (and - for stdin) JSON-parsed escape hatch for long payloads that don't survive shell quoting. - Two declarative fields on createPRReviewDef.parameters.comments.cliAliases + createPRReviewDef.cli.fileInputAlternatives. Zero edits to shared infrastructure (cliCommandFactory, manifestGenerator, nativeToolPrompts, errorEnvelope) — proves spec 014's single-entrypoint invariant. Per-plan ACs #1, #2, #3, #5, #6, #7, #8, #9, #11, #12 auto-verified (unit tests + build + lint + typecheck). AC #4 (binary-level smoke) tagged [manual] because vitest fork-pool workers fail to capture stdout/stderr from spawned binaries that do top-level await import(); the six scenarios were verified manually against the built binary and the trace is recorded in the plan. AC #10 n/a — integration test path abandoned for the same reason. All plans done. Spec 014 marked .done (docs/specs/014-*.md → .done). CHANGELOG Unreleased updated with a per-plan entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 tasks
zbigniewsobiecki
added a commit
that referenced
this pull request
Apr 26, 2026
* docs(spec/plans): add spec 015 + plans for router dispatch failure recovery
Spec captures the silent black-hole bug class verified live on 2026-04-26
(ucho/MNG-350): a transient capacity miss or Docker error during worker
spawn turns a webhook-driven job into a permanently failed BullMQ entry
while stranding the work-item / agent-type locks for up to 30 minutes,
silently rejecting subsequent webhooks for the same work item.
Decomposed into two plans with safety-net-first sequencing: plan 1 hooks
the BullMQ failed event to release locks on every dispatch failure path;
plan 2 replaces the throw-on-capacity with a wait-for-slot semaphore,
adds bounded retry with exponential backoff, and a transient/terminal
error classifier.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(plan): lock 015/1 failed-event-lock-compensation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(router): plan 015/1 done — release locks on dispatch failure
Closes the stranded-lock half of spec 015's bug class verified live in
prod on 2026-04-26 (ucho/MNG-350). When a webhook-driven job's dispatch
fails — capacity throw, Docker spawn error, or any future throw site —
the work-item lock, agent-type concurrency counter, and recently-dispatched
dedup mark established by the webhook → enqueue path are now released by
a compensator hooked to BullMQ's `worker.on('failed')` event.
What landed:
- `src/router/dispatch-compensator.ts` (new) — `releaseLocksForFailedJob`
wraps `extractProjectIdFromJob` / `extractWorkItemId` / `extractAgentType`
and calls into `clearWorkItemEnqueued` / `clearAgentTypeEnqueued` /
`clearRecentlyDispatched`. Never propagates errors; captures to Sentry
with `tags: { source: 'dispatch_compensator' }`.
- `src/router/agent-type-lock.ts` — exports new `clearRecentlyDispatched`
for the compensator. The existing `markRecentlyDispatched` semantics
are unchanged (60s TTL, NOT cleared on completion); this helper exists
solely so a permanently-failed dispatch doesn't keep deduping a fresh
webhook for ~60s while the user retries.
- `src/router/bullmq-workers.ts` — extends the existing
`worker.on('failed')` handler to invoke `releaseLocksForFailedJob`
alongside the existing logger + Sentry calls. Wraps the call in a
defensive `.catch` so a future regression in the compensator can't
poison the worker.
- `src/router/lock-state-classifier.ts` (new) — `classifyLockState` returns
`'awaiting-slot'` when an active worker or queued/waiting job matches
the trio, `'wedged'` when neither correlation matches. Defaults to
`'awaiting-slot'` on classifier error so a Redis blip doesn't mis-emit
the wedged canary.
- `src/router/active-workers.ts` — `getActiveWorkers()` now exposes
`(projectId, workItemId, agentType)` so the classifier can correlate.
Backwards-compatible (existing callers work unchanged; new fields are
additive optional).
- `src/router/webhook-processor.ts` — Step 8 (work-item lock check) now
splits the decision-reason vocabulary into three states:
* `Job queued: ...` (success path)
* `Awaiting worker slot: ...` (lock held + dispatch in flight; healthy)
* `Work item locked (no active dispatch): ...` (wedged-lock canary)
The wedged branch additionally fires `captureException` with
`tags: { source: 'wedged_lock_canary' }` so any regression in
compensation is loud in production.
What this does NOT change (intentional, all in plan 015/2):
- `guardedSpawn` still throws on capacity (BullMQ marks the job failed,
the compensator now releases the locks, but the job itself is still
lost). Plan 2 replaces the throw with a wait-for-slot semaphore.
- Both queues still default to `attempts: 1`. Plan 2 raises this with
exponential backoff and adds a transient/terminal error classifier.
- CLAUDE.md is intentionally not updated by this plan — the unified
passage describing both halves of the new contract lands in plan 015/2.
Tests:
- 5 new unit tests in `dispatch-compensator.test.ts`
- 3 new unit tests in `agent-type-lock.test.ts` for `clearRecentlyDispatched`
- 4 new unit tests in `bullmq-workers.test.ts` for the failed-event seam
- 5 new unit tests in `lock-state-classifier.test.ts`
- 2 new unit tests in `active-workers.test.ts` for the extended shape
- 4 new unit tests in `webhook-processor.test.ts` for the three-way taxonomy
- 3 new module-integration tests in
`tests/integration/router/dispatch-failure-compensation.test.ts` exercise
the real lock modules + real bullmq-workers.ts failed-event handler +
real compensator end-to-end (only BullMQ's Worker constructor + the
worker-env extractors are mocked).
Full suite: 8515 passed / 23 skipped / 0 failed. Lint + typecheck clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(plan): lock 015/2 wait-for-slot-and-retry-classifier
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(plan): mark 015/2 status: wip
* feat(router): plan 015/2 done — wait-for-slot + retry budget + classifier
Closes the lost-job half of spec 015's bug class. Combined with plan 015/1,
the silent black-hole failure mode verified live in prod on 2026-04-26
(ucho/MNG-350) is now fully closed.
What landed:
- `src/router/slot-waiter.ts` (new) — semaphore-style primitive:
`acquireSlot({ timeoutMs })` resolves immediately when capacity is
below `routerConfig.maxWorkers`, otherwise queues a FIFO waiter with
a bounded timeout that rejects with `code: 'SLOT_WAIT_TIMEOUT'`.
`slotReleased()` pops the head waiter; `clearAllWaiters()` rejects
every pending waiter with `code: 'SHUTDOWN'` on router stop.
- `src/router/dispatch-error-classifier.ts` (new) — classifies thrown
errors into `'transient'` (Docker socket Node codes, HTTP 429/409,
SLOT_WAIT_TIMEOUT, anything unknown — default-to-retry) vs
`'terminal'` (TypeError, ZodError, image-not-found-after-fallback).
- `src/router/worker-manager.ts` — `guardedSpawn` rewritten:
`await acquireSlot(...)` replaces the synchronous capacity throw;
on spawn error, terminal errors are wrapped in BullMQ's
`UnrecoverableError` so retries skip; transient errors propagate
unchanged so BullMQ retries via attempts/backoff.
- `src/router/active-workers.ts` — `cleanupWorker` now calls
`slotReleased()` exactly once per cleanup, including on the crash
path. The existing `if (worker)` guard ensures idempotence.
- `src/router/config.ts` — new `slotWaitTimeoutMs` field (default
5min, configurable via `SLOT_WAIT_TIMEOUT_MS`).
- `src/router/queue.ts` and `src/queue/client.ts` — both queues now
default to `attempts: 4` with `backoff: { type: 'exponential', delay: 5000 }`
(~75s total before exhaustion). Terminal errors bypass via
`UnrecoverableError`.
- `src/router/container-manager.ts` — exports the existing
`isImageNotFoundError` predicate so the classifier can reuse it.
Test contract change (spec AC #9):
The previous `tests/unit/router/worker-manager.test.ts:179` assertion
`'processFn throws when at capacity'` is REPLACED (not deleted) with
`'processFn awaits a slot when at capacity, then dispatches when one
frees'`. The throw-on-capacity contract is gone forever.
Tests:
- 7 new unit tests in `slot-waiter.test.ts` (FIFO, timeout, no-op,
shutdown rejection)
- 11 new unit tests in `dispatch-error-classifier.test.ts` covering
every transient/terminal class
- 4 new unit tests in `worker-manager.test.ts` (replaced original
capacity-throw test + 3 for retry classification)
- 3 new unit tests in `active-workers.test.ts` for slotReleased
integration
- 5 new module-integration tests in `dispatch-retry.test.ts` exercise
REAL guardedSpawn + REAL slot-waiter + REAL dispatch-error-classifier
against both queues, mocking only spawnWorker + BullMQ Worker
constructor.
Plan 1's 3 module-integration tests continue to pass alongside plan 2's 5.
Full unit suite: 8539 passed / 23 skipped / 0 failed.
CLAUDE.md updated with a new "Dispatch failure semantics" section
documenting the unified contract (capacity wait, retry budget,
classifier, three-way decision-reason taxonomy from plan 1, wedged-lock
canary). File now 182 lines, under the 200-line cap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(spec): 015 done — router job dispatch failure recovery, all plans complete
Closes the silent black-hole bug class verified live on 2026-04-26
(ucho/MNG-350). Plan 1 added failed-event lock compensation +
three-way decision-reason taxonomy; plan 2 replaced the throw-on-capacity
with wait-for-slot, added bounded retry with exponential backoff, and
introduced a transient/terminal error classifier.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zbigniewsobiecki
added a commit
that referenced
this pull request
Apr 26, 2026
* fix(triggers): audit & fix PM feedback inconsistencies across respond-to-* agents (#1201) * fix(triggers): audit & fix PM feedback inconsistencies across respond-to-* agents * fix(triggers): use case-insensitive JIRA status comparison in isInPlanningStatus Match the established pattern from status-changed.ts and label-added.ts which both use .toLowerCase() for JIRA status comparisons, since status names are user-configurable and the API does not guarantee consistent casing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(linear): populate inlineMedia from descriptions/comments and add downloadAttachment (#1202) Co-authored-by: Cascade Bot <bot@cascade.dev> * fix(claude-code): pin pathToClaudeCodeExecutable so SDK skips broken native-binary probe (#1206) The agent-harness SDK bump in #1197 (claude-agent-sdk 0.2.91 → 0.2.119) broke every review run on cascade-prod with: ReferenceError: Claude Code native binary not found at /app/node_modules/@anthropic-ai/claude-agent-sdk-linux-x64-musl/claude The new SDK probes its own platform-specific optional-dependency subpackages for a bundled `claude` binary. Two failure modes hit at once: 1. Cascade installs `@anthropic-ai/claude-code@2.1.119` globally at /usr/local/bin/claude — the SDK never looks there. 2. The SDK probes the `-musl` variant first regardless of host libc and errors on ENOENT instead of falling through to the glibc variant. Pass an explicit `pathToClaudeCodeExecutable` to short-circuit the probe. The resolver checks (in order): - $CLAUDE_CODE_EXECUTABLE_PATH env override (local-dev escape hatch) - `which claude` in $PATH - /usr/local/bin/claude (Docker default from Dockerfile.worker) Two TDD tests pin the option onto query() and prove the env override wins. No Dockerfile change needed; the existing global install at /usr/local/bin/claude becomes the resolver's runtime target. Confirmed broken on ucho PR #72 (cascade-prod review agent crash). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * spec 015: router job dispatch failure recovery (#1203) * docs(spec/plans): add spec 015 + plans for router dispatch failure recovery Spec captures the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350): a transient capacity miss or Docker error during worker spawn turns a webhook-driven job into a permanently failed BullMQ entry while stranding the work-item / agent-type locks for up to 30 minutes, silently rejecting subsequent webhooks for the same work item. Decomposed into two plans with safety-net-first sequencing: plan 1 hooks the BullMQ failed event to release locks on every dispatch failure path; plan 2 replaces the throw-on-capacity with a wait-for-slot semaphore, adds bounded retry with exponential backoff, and a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 015/1 failed-event-lock-compensation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(router): plan 015/1 done — release locks on dispatch failure Closes the stranded-lock half of spec 015's bug class verified live in prod on 2026-04-26 (ucho/MNG-350). When a webhook-driven job's dispatch fails — capacity throw, Docker spawn error, or any future throw site — the work-item lock, agent-type concurrency counter, and recently-dispatched dedup mark established by the webhook → enqueue path are now released by a compensator hooked to BullMQ's `worker.on('failed')` event. What landed: - `src/router/dispatch-compensator.ts` (new) — `releaseLocksForFailedJob` wraps `extractProjectIdFromJob` / `extractWorkItemId` / `extractAgentType` and calls into `clearWorkItemEnqueued` / `clearAgentTypeEnqueued` / `clearRecentlyDispatched`. Never propagates errors; captures to Sentry with `tags: { source: 'dispatch_compensator' }`. - `src/router/agent-type-lock.ts` — exports new `clearRecentlyDispatched` for the compensator. The existing `markRecentlyDispatched` semantics are unchanged (60s TTL, NOT cleared on completion); this helper exists solely so a permanently-failed dispatch doesn't keep deduping a fresh webhook for ~60s while the user retries. - `src/router/bullmq-workers.ts` — extends the existing `worker.on('failed')` handler to invoke `releaseLocksForFailedJob` alongside the existing logger + Sentry calls. Wraps the call in a defensive `.catch` so a future regression in the compensator can't poison the worker. - `src/router/lock-state-classifier.ts` (new) — `classifyLockState` returns `'awaiting-slot'` when an active worker or queued/waiting job matches the trio, `'wedged'` when neither correlation matches. Defaults to `'awaiting-slot'` on classifier error so a Redis blip doesn't mis-emit the wedged canary. - `src/router/active-workers.ts` — `getActiveWorkers()` now exposes `(projectId, workItemId, agentType)` so the classifier can correlate. Backwards-compatible (existing callers work unchanged; new fields are additive optional). - `src/router/webhook-processor.ts` — Step 8 (work-item lock check) now splits the decision-reason vocabulary into three states: * `Job queued: ...` (success path) * `Awaiting worker slot: ...` (lock held + dispatch in flight; healthy) * `Work item locked (no active dispatch): ...` (wedged-lock canary) The wedged branch additionally fires `captureException` with `tags: { source: 'wedged_lock_canary' }` so any regression in compensation is loud in production. What this does NOT change (intentional, all in plan 015/2): - `guardedSpawn` still throws on capacity (BullMQ marks the job failed, the compensator now releases the locks, but the job itself is still lost). Plan 2 replaces the throw with a wait-for-slot semaphore. - Both queues still default to `attempts: 1`. Plan 2 raises this with exponential backoff and adds a transient/terminal error classifier. - CLAUDE.md is intentionally not updated by this plan — the unified passage describing both halves of the new contract lands in plan 015/2. Tests: - 5 new unit tests in `dispatch-compensator.test.ts` - 3 new unit tests in `agent-type-lock.test.ts` for `clearRecentlyDispatched` - 4 new unit tests in `bullmq-workers.test.ts` for the failed-event seam - 5 new unit tests in `lock-state-classifier.test.ts` - 2 new unit tests in `active-workers.test.ts` for the extended shape - 4 new unit tests in `webhook-processor.test.ts` for the three-way taxonomy - 3 new module-integration tests in `tests/integration/router/dispatch-failure-compensation.test.ts` exercise the real lock modules + real bullmq-workers.ts failed-event handler + real compensator end-to-end (only BullMQ's Worker constructor + the worker-env extractors are mocked). Full suite: 8515 passed / 23 skipped / 0 failed. Lint + typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 015/2 wait-for-slot-and-retry-classifier Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): mark 015/2 status: wip * feat(router): plan 015/2 done — wait-for-slot + retry budget + classifier Closes the lost-job half of spec 015's bug class. Combined with plan 015/1, the silent black-hole failure mode verified live in prod on 2026-04-26 (ucho/MNG-350) is now fully closed. What landed: - `src/router/slot-waiter.ts` (new) — semaphore-style primitive: `acquireSlot({ timeoutMs })` resolves immediately when capacity is below `routerConfig.maxWorkers`, otherwise queues a FIFO waiter with a bounded timeout that rejects with `code: 'SLOT_WAIT_TIMEOUT'`. `slotReleased()` pops the head waiter; `clearAllWaiters()` rejects every pending waiter with `code: 'SHUTDOWN'` on router stop. - `src/router/dispatch-error-classifier.ts` (new) — classifies thrown errors into `'transient'` (Docker socket Node codes, HTTP 429/409, SLOT_WAIT_TIMEOUT, anything unknown — default-to-retry) vs `'terminal'` (TypeError, ZodError, image-not-found-after-fallback). - `src/router/worker-manager.ts` — `guardedSpawn` rewritten: `await acquireSlot(...)` replaces the synchronous capacity throw; on spawn error, terminal errors are wrapped in BullMQ's `UnrecoverableError` so retries skip; transient errors propagate unchanged so BullMQ retries via attempts/backoff. - `src/router/active-workers.ts` — `cleanupWorker` now calls `slotReleased()` exactly once per cleanup, including on the crash path. The existing `if (worker)` guard ensures idempotence. - `src/router/config.ts` — new `slotWaitTimeoutMs` field (default 5min, configurable via `SLOT_WAIT_TIMEOUT_MS`). - `src/router/queue.ts` and `src/queue/client.ts` — both queues now default to `attempts: 4` with `backoff: { type: 'exponential', delay: 5000 }` (~75s total before exhaustion). Terminal errors bypass via `UnrecoverableError`. - `src/router/container-manager.ts` — exports the existing `isImageNotFoundError` predicate so the classifier can reuse it. Test contract change (spec AC #9): The previous `tests/unit/router/worker-manager.test.ts:179` assertion `'processFn throws when at capacity'` is REPLACED (not deleted) with `'processFn awaits a slot when at capacity, then dispatches when one frees'`. The throw-on-capacity contract is gone forever. Tests: - 7 new unit tests in `slot-waiter.test.ts` (FIFO, timeout, no-op, shutdown rejection) - 11 new unit tests in `dispatch-error-classifier.test.ts` covering every transient/terminal class - 4 new unit tests in `worker-manager.test.ts` (replaced original capacity-throw test + 3 for retry classification) - 3 new unit tests in `active-workers.test.ts` for slotReleased integration - 5 new module-integration tests in `dispatch-retry.test.ts` exercise REAL guardedSpawn + REAL slot-waiter + REAL dispatch-error-classifier against both queues, mocking only spawnWorker + BullMQ Worker constructor. Plan 1's 3 module-integration tests continue to pass alongside plan 2's 5. Full unit suite: 8539 passed / 23 skipped / 0 failed. CLAUDE.md updated with a new "Dispatch failure semantics" section documenting the unified contract (capacity wait, retry budget, classifier, three-way decision-reason taxonomy from plan 1, wedged-lock canary). File now 182 lines, under the 200-line cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(spec): 015 done — router job dispatch failure recovery, all plans complete Closes the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350). Plan 1 added failed-event lock compensation + three-way decision-reason taxonomy; plan 2 replaced the throw-on-capacity with wait-for-slot, added bounded retry with exponential backoff, and introduced a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(deps): bump postcss from 8.5.8 to 8.5.12 (#1204) Bumps [postcss](https://github.com/postcss/postcss) from 8.5.8 to 8.5.12. - [Release notes](https://github.com/postcss/postcss/releases) - [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md) - [Commits](postcss/postcss@8.5.8...8.5.12) --- updated-dependencies: - dependency-name: postcss dependency-version: 8.5.12 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump uuid, bullmq and dockerode (#1192) Removes [uuid](https://github.com/uuidjs/uuid). It's no longer used after updating ancestor dependencies [uuid](https://github.com/uuidjs/uuid), [bullmq](https://github.com/taskforcesh/bullmq) and [dockerode](https://github.com/apocas/dockerode). These dependencies need to be updated together. Removes `uuid` Updates `bullmq` from 5.72.0 to 5.76.2 - [Release notes](https://github.com/taskforcesh/bullmq/releases) - [Commits](taskforcesh/bullmq@v5.72.0...v5.76.2) Updates `dockerode` from 4.0.10 to 5.0.0 - [Release notes](https://github.com/apocas/dockerode/releases) - [Commits](apocas/dockerode@v4.0.10...v5.0.0) --- updated-dependencies: - dependency-name: uuid dependency-version: dependency-type: indirect - dependency-name: bullmq dependency-version: 5.76.2 dependency-type: direct:production - dependency-name: dockerode dependency-version: 5.0.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: aaight <aaight42@gmail.com> Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
zbigniewsobiecki
added a commit
that referenced
this pull request
Apr 26, 2026
* fix(triggers): audit & fix PM feedback inconsistencies across respond-to-* agents (#1201) * fix(triggers): audit & fix PM feedback inconsistencies across respond-to-* agents * fix(triggers): use case-insensitive JIRA status comparison in isInPlanningStatus Match the established pattern from status-changed.ts and label-added.ts which both use .toLowerCase() for JIRA status comparisons, since status names are user-configurable and the API does not guarantee consistent casing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(linear): populate inlineMedia from descriptions/comments and add downloadAttachment (#1202) Co-authored-by: Cascade Bot <bot@cascade.dev> * fix(claude-code): pin pathToClaudeCodeExecutable so SDK skips broken native-binary probe (#1206) The agent-harness SDK bump in #1197 (claude-agent-sdk 0.2.91 → 0.2.119) broke every review run on cascade-prod with: ReferenceError: Claude Code native binary not found at /app/node_modules/@anthropic-ai/claude-agent-sdk-linux-x64-musl/claude The new SDK probes its own platform-specific optional-dependency subpackages for a bundled `claude` binary. Two failure modes hit at once: 1. Cascade installs `@anthropic-ai/claude-code@2.1.119` globally at /usr/local/bin/claude — the SDK never looks there. 2. The SDK probes the `-musl` variant first regardless of host libc and errors on ENOENT instead of falling through to the glibc variant. Pass an explicit `pathToClaudeCodeExecutable` to short-circuit the probe. The resolver checks (in order): - $CLAUDE_CODE_EXECUTABLE_PATH env override (local-dev escape hatch) - `which claude` in $PATH - /usr/local/bin/claude (Docker default from Dockerfile.worker) Two TDD tests pin the option onto query() and prove the env override wins. No Dockerfile change needed; the existing global install at /usr/local/bin/claude becomes the resolver's runtime target. Confirmed broken on ucho PR #72 (cascade-prod review agent crash). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * spec 015: router job dispatch failure recovery (#1203) * docs(spec/plans): add spec 015 + plans for router dispatch failure recovery Spec captures the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350): a transient capacity miss or Docker error during worker spawn turns a webhook-driven job into a permanently failed BullMQ entry while stranding the work-item / agent-type locks for up to 30 minutes, silently rejecting subsequent webhooks for the same work item. Decomposed into two plans with safety-net-first sequencing: plan 1 hooks the BullMQ failed event to release locks on every dispatch failure path; plan 2 replaces the throw-on-capacity with a wait-for-slot semaphore, adds bounded retry with exponential backoff, and a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 015/1 failed-event-lock-compensation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(router): plan 015/1 done — release locks on dispatch failure Closes the stranded-lock half of spec 015's bug class verified live in prod on 2026-04-26 (ucho/MNG-350). When a webhook-driven job's dispatch fails — capacity throw, Docker spawn error, or any future throw site — the work-item lock, agent-type concurrency counter, and recently-dispatched dedup mark established by the webhook → enqueue path are now released by a compensator hooked to BullMQ's `worker.on('failed')` event. What landed: - `src/router/dispatch-compensator.ts` (new) — `releaseLocksForFailedJob` wraps `extractProjectIdFromJob` / `extractWorkItemId` / `extractAgentType` and calls into `clearWorkItemEnqueued` / `clearAgentTypeEnqueued` / `clearRecentlyDispatched`. Never propagates errors; captures to Sentry with `tags: { source: 'dispatch_compensator' }`. - `src/router/agent-type-lock.ts` — exports new `clearRecentlyDispatched` for the compensator. The existing `markRecentlyDispatched` semantics are unchanged (60s TTL, NOT cleared on completion); this helper exists solely so a permanently-failed dispatch doesn't keep deduping a fresh webhook for ~60s while the user retries. - `src/router/bullmq-workers.ts` — extends the existing `worker.on('failed')` handler to invoke `releaseLocksForFailedJob` alongside the existing logger + Sentry calls. Wraps the call in a defensive `.catch` so a future regression in the compensator can't poison the worker. - `src/router/lock-state-classifier.ts` (new) — `classifyLockState` returns `'awaiting-slot'` when an active worker or queued/waiting job matches the trio, `'wedged'` when neither correlation matches. Defaults to `'awaiting-slot'` on classifier error so a Redis blip doesn't mis-emit the wedged canary. - `src/router/active-workers.ts` — `getActiveWorkers()` now exposes `(projectId, workItemId, agentType)` so the classifier can correlate. Backwards-compatible (existing callers work unchanged; new fields are additive optional). - `src/router/webhook-processor.ts` — Step 8 (work-item lock check) now splits the decision-reason vocabulary into three states: * `Job queued: ...` (success path) * `Awaiting worker slot: ...` (lock held + dispatch in flight; healthy) * `Work item locked (no active dispatch): ...` (wedged-lock canary) The wedged branch additionally fires `captureException` with `tags: { source: 'wedged_lock_canary' }` so any regression in compensation is loud in production. What this does NOT change (intentional, all in plan 015/2): - `guardedSpawn` still throws on capacity (BullMQ marks the job failed, the compensator now releases the locks, but the job itself is still lost). Plan 2 replaces the throw with a wait-for-slot semaphore. - Both queues still default to `attempts: 1`. Plan 2 raises this with exponential backoff and adds a transient/terminal error classifier. - CLAUDE.md is intentionally not updated by this plan — the unified passage describing both halves of the new contract lands in plan 015/2. Tests: - 5 new unit tests in `dispatch-compensator.test.ts` - 3 new unit tests in `agent-type-lock.test.ts` for `clearRecentlyDispatched` - 4 new unit tests in `bullmq-workers.test.ts` for the failed-event seam - 5 new unit tests in `lock-state-classifier.test.ts` - 2 new unit tests in `active-workers.test.ts` for the extended shape - 4 new unit tests in `webhook-processor.test.ts` for the three-way taxonomy - 3 new module-integration tests in `tests/integration/router/dispatch-failure-compensation.test.ts` exercise the real lock modules + real bullmq-workers.ts failed-event handler + real compensator end-to-end (only BullMQ's Worker constructor + the worker-env extractors are mocked). Full suite: 8515 passed / 23 skipped / 0 failed. Lint + typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 015/2 wait-for-slot-and-retry-classifier Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): mark 015/2 status: wip * feat(router): plan 015/2 done — wait-for-slot + retry budget + classifier Closes the lost-job half of spec 015's bug class. Combined with plan 015/1, the silent black-hole failure mode verified live in prod on 2026-04-26 (ucho/MNG-350) is now fully closed. What landed: - `src/router/slot-waiter.ts` (new) — semaphore-style primitive: `acquireSlot({ timeoutMs })` resolves immediately when capacity is below `routerConfig.maxWorkers`, otherwise queues a FIFO waiter with a bounded timeout that rejects with `code: 'SLOT_WAIT_TIMEOUT'`. `slotReleased()` pops the head waiter; `clearAllWaiters()` rejects every pending waiter with `code: 'SHUTDOWN'` on router stop. - `src/router/dispatch-error-classifier.ts` (new) — classifies thrown errors into `'transient'` (Docker socket Node codes, HTTP 429/409, SLOT_WAIT_TIMEOUT, anything unknown — default-to-retry) vs `'terminal'` (TypeError, ZodError, image-not-found-after-fallback). - `src/router/worker-manager.ts` — `guardedSpawn` rewritten: `await acquireSlot(...)` replaces the synchronous capacity throw; on spawn error, terminal errors are wrapped in BullMQ's `UnrecoverableError` so retries skip; transient errors propagate unchanged so BullMQ retries via attempts/backoff. - `src/router/active-workers.ts` — `cleanupWorker` now calls `slotReleased()` exactly once per cleanup, including on the crash path. The existing `if (worker)` guard ensures idempotence. - `src/router/config.ts` — new `slotWaitTimeoutMs` field (default 5min, configurable via `SLOT_WAIT_TIMEOUT_MS`). - `src/router/queue.ts` and `src/queue/client.ts` — both queues now default to `attempts: 4` with `backoff: { type: 'exponential', delay: 5000 }` (~75s total before exhaustion). Terminal errors bypass via `UnrecoverableError`. - `src/router/container-manager.ts` — exports the existing `isImageNotFoundError` predicate so the classifier can reuse it. Test contract change (spec AC #9): The previous `tests/unit/router/worker-manager.test.ts:179` assertion `'processFn throws when at capacity'` is REPLACED (not deleted) with `'processFn awaits a slot when at capacity, then dispatches when one frees'`. The throw-on-capacity contract is gone forever. Tests: - 7 new unit tests in `slot-waiter.test.ts` (FIFO, timeout, no-op, shutdown rejection) - 11 new unit tests in `dispatch-error-classifier.test.ts` covering every transient/terminal class - 4 new unit tests in `worker-manager.test.ts` (replaced original capacity-throw test + 3 for retry classification) - 3 new unit tests in `active-workers.test.ts` for slotReleased integration - 5 new module-integration tests in `dispatch-retry.test.ts` exercise REAL guardedSpawn + REAL slot-waiter + REAL dispatch-error-classifier against both queues, mocking only spawnWorker + BullMQ Worker constructor. Plan 1's 3 module-integration tests continue to pass alongside plan 2's 5. Full unit suite: 8539 passed / 23 skipped / 0 failed. CLAUDE.md updated with a new "Dispatch failure semantics" section documenting the unified contract (capacity wait, retry budget, classifier, three-way decision-reason taxonomy from plan 1, wedged-lock canary). File now 182 lines, under the 200-line cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(spec): 015 done — router job dispatch failure recovery, all plans complete Closes the silent black-hole bug class verified live on 2026-04-26 (ucho/MNG-350). Plan 1 added failed-event lock compensation + three-way decision-reason taxonomy; plan 2 replaced the throw-on-capacity with wait-for-slot, added bounded retry with exponential backoff, and introduced a transient/terminal error classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(deps): bump postcss from 8.5.8 to 8.5.12 (#1204) Bumps [postcss](https://github.com/postcss/postcss) from 8.5.8 to 8.5.12. - [Release notes](https://github.com/postcss/postcss/releases) - [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md) - [Commits](postcss/postcss@8.5.8...8.5.12) --- updated-dependencies: - dependency-name: postcss dependency-version: 8.5.12 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump uuid, bullmq and dockerode (#1192) Removes [uuid](https://github.com/uuidjs/uuid). It's no longer used after updating ancestor dependencies [uuid](https://github.com/uuidjs/uuid), [bullmq](https://github.com/taskforcesh/bullmq) and [dockerode](https://github.com/apocas/dockerode). These dependencies need to be updated together. Removes `uuid` Updates `bullmq` from 5.72.0 to 5.76.2 - [Release notes](https://github.com/taskforcesh/bullmq/releases) - [Commits](taskforcesh/bullmq@v5.72.0...v5.76.2) Updates `dockerode` from 4.0.10 to 5.0.0 - [Release notes](https://github.com/apocas/dockerode/releases) - [Commits](apocas/dockerode@v4.0.10...v5.0.0) --- updated-dependencies: - dependency-name: uuid dependency-version: dependency-type: indirect - dependency-name: bullmq dependency-version: 5.76.2 dependency-type: direct:production - dependency-name: dockerode dependency-version: 5.0.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * spec 016: PM image delivery reliability (#1209) * docs(spec/plans): add spec 016 + plans for PM image delivery reliability Closes the silent screenshot-drop bug class verified live on 2026-04-26 (ucho/MNG-357): Linear's user-pasted-image URLs (uploads.linear.app/<uuid> with no file extension) were dropped at the pre-download MIME filter because mimeTypeFromUrl returned 'application/octet-stream' and filterImageMedia excluded them. This affected all engines on the disk-write path, regardless of PR #948's Claude-Code SDK delivery fix. Three plans, safety-net-first sequencing matching spec 015: - Plan 1 (boot-path-mime-fix-and-diagnostic-log): defers MIME authority to download response Content-Type via image/* wildcard sentinel; adds the grep-stable diagnostic log line at extract time. Independently fixes MNG-357. - Plan 2 (runtime-gadget-image-delivery): makes the runtime cascade-tools pm read-work-item gadget actually download + write images to disk with file paths returned in text. Closes the mid-run pickup gap. Depends on Plan 1's shared download-and-prepare helper. - Plan 3 (linear-fixture-and-extraction-coverage): captures a Linear GraphQL Issue payload fixture for an issue with a pasted screenshot; pins extraction with a regression test that fails loudly if Linear ever changes the payload shape. Mostly tests + docs. 9 ACs, 0 manual-only. CLAUDE.md not updated (already covered by spec 015's silent-failure → diagnostic-line pattern). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 016/1 boot-path-mime-fix-and-diagnostic-log * feat(pm): plan 016/1 done — boot-path image MIME fix + diagnostic log Closes the silent screenshot-drop bug class verified live on 2026-04-26 (ucho/MNG-357). Linear's extension-less pasted-image URLs (uploads.linear.app/<uuid>) now survive the pre-download MIME filter via an image/* wildcard sentinel. The download response's Content-Type header is the authoritative MIME — wildcard is resolved before bytes are written. What landed: - src/pm/media.ts — new IMAGE_HOST_ALLOWLIST (currently 'uploads.linear.app'); mimeTypeFromUrl returns 'image/*' for extension-less URLs from allowlisted hosts; isImageMimeType accepts the wildcard. - src/pm/download-and-prepare.ts (new) — shared helper for the per-provider download dispatch loop (jira/linear/trello). Returns { images, failures }. Spec 016/2's runtime gadget will import this. - src/agents/definitions/contextSteps.ts — fetchWorkItemStep refactored to use the shared helper; emits the new grep-stable diagnostic line '[image-pipeline] work-item-fetch summary' with stable fields: { provider, workItemId, urlsDetected, urlsAfterFilter, urlsDownloaded, urlsFailed, urlsByMimeType }. Tests: - 6 new unit tests in tests/unit/pm/media.test.ts (wildcard sentinel, Linear extension-less, regression for extensioned + non-PM URLs) - 7 new unit tests in tests/unit/pm/download-and-prepare.test.ts - 3 new diagnostic-log tests in contextSteps.test.ts; existing log message expectations updated to the new helper-prefix - 3 module-integration tests in tests/integration/pm/image-pipeline.test.ts pinning the MNG-357 reproduction end-to-end with real mimeTypeFromUrl + filterImageMedia + extractMarkdownImages PR #948's Claude-Code initial-input ImageBlockParam path is unchanged; existing regression test (claude-code.test.ts:939 'logs image injection and strips images before buildTaskPrompt') confirms. Docs: - CHANGELOG.md entry under Unreleased. - src/integrations/README.md gains a new 'Image delivery contract' section documenting the shared resolution path, allowlist semantics, diagnostic log line schema, and the rule that providers shouldn't write their own MIME-detection. Full unit suite: 8521 passed / 23 skipped / 0 failed. Lint + typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 016/2 runtime-gadget-image-delivery * feat(pm): plan 016/2 done — runtime gadget delivers images on disk Closes the mid-run image pickup gap from spec 016. The runtime gadget `cascade-tools pm read-work-item` now downloads any image media and writes it to .cascade/context/images/work-item-<id>-img-<index>.<ext>, returning text whose new "Local Image Files" section lists actual file paths the agent's file-read tool can consume. What landed: - src/gadgets/pm/core/writeRuntimeImages.ts (new) — writes ContextImage arrays to .cascade/context/images/ with stable naming convention (work-item-<id>-img-<i>.<ext>); extension derived from resolved MIME; falls back to .bin + warn log for unresolved image/* sentinel. - src/gadgets/pm/core/readWorkItem.ts — readWorkItem now calls downloadAndPrepareImages (Plan 1's helper) + writeRuntimeImages (this plan), then mutates the returned text to include the local file paths via formatRuntimeImagePaths. Same diagnostic log line '[image-pipeline] work-item-fetch summary' as the boot path. Failed downloads surface in a "Failed Image Downloads" subsection. Tests: - 8 new unit tests in tests/unit/gadgets/pm/core/writeRuntimeImages.test.ts - 5 new unit tests in tests/unit/gadgets/pm/core/readWorkItem.test.ts (spec 016/2 sub-describe) - 4 new module-integration tests in tests/integration/gadgets/runtime-image-delivery.test.ts pinning the mid-run pickup contract end-to-end. CHANGELOG.md entry added. Full unit suite (single-fork): 8534 passed / 23 skipped / 0 failed. Lint + typecheck clean. Three PM manifest test suites occasionally time out under parallel load on this machine — verified to pass in isolation; not a code regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): lock 016/3 linear-fixture-and-extraction-coverage * test(pm): plan 016/3 done — Linear fixture + extraction-coverage regression Closes spec 016 with the regression net for the contract Plans 1+2 established. If Linear ever changes its Issue payload shape in a way that loses inline images, the extraction-coverage test fails loudly with a specific URL-missing message. What landed: - tests/fixtures/linear-issue-with-screenshot.json (new) — reconstructed Linear GraphQL Issue payload covering: extension-less uploads.linear.app URL in description, extensioned Linear URL with alt text, external URL with image/svg+xml MIME, non-image markdown link (must NOT be picked up), one comment with a pasted screenshot, one comment without, and three formal Attachment records (Slack/GitHub/Sentry link previews). - tests/unit/pm/linear/extraction-coverage.test.ts (new) — 9 tests: description coverage with explicit expected-URL list, image/* sentinel for extension-less, concrete MIME for extensioned, image/svg+xml for external SVG, non-image link exclusion, comment coverage, comment source field, attachment-NOT-leaked rule, meta-test of regression net. - src/integrations/README.md — new "Linear: GraphQL surface for inline images" subsection documenting the conclusion: Issue.description markdown is canonical for inline-pasted screenshots; Issue.attachments is for formal Attachment records (link previews) and is the wrong surface for inline images. Links to the fixture and the test. No production code change — Plan 1's mimeTypeFromUrl + extractMarkdownImages already cover the cases. This plan ships the regression armor. CHANGELOG.md entry added. Lint + typecheck clean. 9/9 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(spec): 016 done — pm-image-delivery-reliability, all plans complete Closes the silent screenshot-drop bug class verified live on 2026-04-26 (ucho/MNG-357). Plan 1 added the Linear-extension-less MIME wildcard sentinel + diagnostic log line; plan 2 made the runtime cascade-tools pm read-work-item gadget actually deliver images on disk; plan 3 captured a Linear GraphQL fixture and pinned extraction coverage with a regression test. CLAUDE.md untouched by this spec — already covered by spec 015's broader silent-failure → diagnostic-line pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address code review concerns * test(image-pipeline): supply urlsDetected in readWorkItemWithMedia mocks The diagnostic-line assertion expected urlsDetected on the log payload, but the mocked readWorkItemWithMedia return values omitted it, so the field arrived as undefined and the toHaveBeenCalledWith match failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Cascade Bot <bot@cascade.dev> * fix(linear): drop comment-mention planning-state gate that prod payload never satisfies (#1210) PR #1201 added a `currentStateId !== planningStateId` gate to the Linear comment @mention trigger that read `data.issue.stateId` from the webhook payload. Linear's Comment webhook does not ship `stateId` on the nested issue (verified across four prod payloads on 2026-04-26 — 8cd0108a, b93e4925, 6548cd14, 3d95b210). The gate therefore always evaluated to true and silently dropped every legitimate bot @mention, including the one on MNG-346 that motivated this fix. The agent (respond-to-planning-comment) is now responsible for any planning-only behavior; the trigger no longer gates on state and avoids an extra Linear GraphQL round-trip per comment. Also corrects `LinearWebhookCommentTriggerData.issue` to match what Linear actually ships (six keys, no `stateId`, optional `team`) — the old type lied and PR #1201 trusted it. Tests pin a real prod-shape Comment payload as a regression. JIRA's equivalent gate is unaffected (its `comment_created` payload does ship `issue.fields.status.name`). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: aaight <aaight42@gmail.com> Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves PostgreSQL handling in the agent execution environment:
postgres:postgres)Changes
src/agents/utils/setup.tsstartPostgres():-wflag for synchronous startup (waits until ready)Dockerfilepg_hba.conffor proper authentication:trust(for admin tasks)md5password authenticationpostgresfor userpostgrespostgresql://postgres:postgres@localhost:5432/postgressrc/agents/prompts/templates/partials/environment.etapg_ctlContext
Analysis of a failed agent session revealed the agent wasted 11+ iterations trying to start PostgreSQL using incorrect commands (
systemctl,brew services) that don't work in the container environment. This PR ensures:Test Plan
🤖 Generated with Claude Code