Merge dev to main#136
Merged
zbigniewsobiecki merged 1 commit intomainfrom Feb 11, 2026
Merged
Conversation
…ation (#135) (#135) Extract ~1,530 lines of duplicated code across 4 agent files and 10 trigger files into focused shared modules, achieving a net reduction of ~436 lines. Agent shared modules (src/agents/shared/): - lifecycle.ts: generic executeAgentLifecycle<TContext>() replacing 4 copies of the ~155-line execution skeleton (logger, watchdog, repo, LLMist, loop) - builderFactory.ts: createConfiguredBuilder() replacing 4 copies of the 13-method AgentBuilder chain (~60 lines each) - syntheticCalls.ts: injectDirectoryListing(), injectContextFiles(), injectAUContext(), injectSyntheticCall() replacing 4 copies each - modelResolution.ts: resolveModelConfig() replacing 4 copies of model/prompt/iteration config resolution - prFormatting.ts: formatPRDetails(), formatPRDiff() replacing 3 identical copies across PR-aware agents - repository.ts: setupRepository() with SetupRepositoryOptions replacing 4 copies of clone/checkout/setup logic Trigger shared modules: - triggers/github/utils.ts: added isSelfAuthored() and requireTrelloCardId() replacing 3 copies of auth-user checks and 3 copies of card validation - triggers/shared/agent-result-handler.ts: handleAgentResultArtifacts() replacing duplicate log-upload + cost-update blocks in both webhook handlers Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
4 tasks
zbigniewsobiecki
added a commit
that referenced
this pull request
Apr 29, 2026
…-pr-comment runs (#1221) Both GitHub triggers set `workItemId` at the top level of their TriggerResult but omitted it from `agentInput`. tryCreateRun reads workItemId from agentInput, so 0/103 respond-to-review and 0/9 respond-to-pr-comment runs had a non-null `agent_runs.work_item_id` — hiding every one of them from the dashboard's work-item page (which filters by that column). Confirmed on 2026-04-29 via `cascade runs list --project ucho`: four respond-to-review runs for ucho/MNG-400 (PR #136) on 2026-04-28 were invisible despite firing successfully. The runtime safety net in `runAgentExecutionPipeline` was supposed to catch this but only patched when re-resolution returned a value DIFFERENT from `result.workItemId`. Since both triggers populated `result.workItemId` correctly, the safety net no-op'd. Three changes: 1. Add `workItemId` (shorthand) to the `agentInput` literal in `pr-review-submitted.ts:82` and `pr-comment-mention.ts:143`. 2. Widen the runtime patch in `agent-execution.ts:498` so it triggers on `agentInput.workItemId !== resolvedWorkItemId` rather than on `result.workItemId !== resolvedWorkItemId`. Now any future trigger that forgets the field still gets repaired before it reaches the DB. 3. Add `tests/unit/triggers/trigger-work-item-id-consistency.test.ts` — a static-analysis guard modeled on `trigger-event-consistency.test.ts`. Walks every TriggerHandler file via the TypeScript AST and asserts that any return literal with a top-level `workItemId` and a real `agentType` also includes `workItemId` inside `agentInput`. Fires loudly with `file:line` when violated. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zbigniewsobiecki
added a commit
that referenced
this pull request
Apr 29, 2026
… agent improvements (#1225) * feat(alerting): add configurable PM results list for alerting agent + enable UX * fix(alerting): map PM provider to correct discovery capability in container picker Replace the non-existent `containers` capability (which caused NOT_IMPLEMENTED errors for every provider) with provider-specific mappings: Trello→`boards`, JIRA→`projects`, Linear→`teams`. Disable the Fetch button with an explanatory tooltip for unknown providers. Also simplify the `alertingResultsContainerId ?? undefined` ternary to a plain `??` expression in promptContext.ts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(triggers): persist workItemId on respond-to-review and respond-to-pr-comment runs Both GitHub triggers set `workItemId` at the top level of their TriggerResult but omitted it from `agentInput`. tryCreateRun reads workItemId from agentInput, so 0/103 respond-to-review and 0/9 respond-to-pr-comment runs had a non-null `agent_runs.work_item_id` — hiding every one of them from the dashboard's work-item page (which filters by that column). Confirmed on 2026-04-29 via `cascade runs list --project ucho`: four respond-to-review runs for ucho/MNG-400 (PR #136) on 2026-04-28 were invisible despite firing successfully. The runtime safety net in `runAgentExecutionPipeline` was supposed to catch this but only patched when re-resolution returned a value DIFFERENT from `result.workItemId`. Since both triggers populated `result.workItemId` correctly, the safety net no-op'd. Three changes: 1. Add `workItemId` (shorthand) to the `agentInput` literal in `pr-review-submitted.ts:82` and `pr-comment-mention.ts:143`. 2. Widen the runtime patch in `agent-execution.ts:498` so it triggers on `agentInput.workItemId !== resolvedWorkItemId` rather than on `result.workItemId !== resolvedWorkItemId`. Now any future trigger that forgets the field still gets repaired before it reaches the DB. 3. Add `tests/unit/triggers/trigger-work-item-id-consistency.test.ts` — a static-analysis guard modeled on `trigger-event-consistency.test.ts`. Walks every TriggerHandler file via the TypeScript AST and asserts that any return literal with a top-level `workItemId` and a real `agentType` also includes `workItemId` inside `agentInput`. Fires loudly with `file:line` when violated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(alerting): fix PM container picker to fetch correct sub-container level - JIRA: switch from `projects` to `states` capability (fetches workflow status names, which is what backlogListId expects); pass configured projectKey as args.containerId; save item.name not numeric state ID - Trello: disable the dropdown picker entirely — no `lists` discovery capability exists in the manifest; users must enter the list ID manually - Linear: keep `teams` capability (team ID is already the correct value) Extract ContainerInput helper component to keep AlertingTab complexity below the project's cognitive-complexity limit. Pass pmConfig from integration-form.tsx so JIRA states can be scoped to the right project. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(spec): 017 router-silent-failure-hardening + plans Three production hardening fixes derived from the 2026-04-29 24h log/webhook audit. Discovered alongside PR #1220 (workItemId-on-respond-to-* fix). Failure modes: A) Linear PM-ack silently skipped on PM-focused agents (24/day) B) Pipeline-capacity gate fails open on every PM status-changed (32/day) C) Progress-comment double-delete race produces 404 log spam (72/day) Decomposed per spec strategic decision #8 into three independent plans (none blocks another): 1-pm-ack-coverage (consolidate dispatch via manifest registry) 2-capacity-gate-pm-scope (shared adapter helper, fail-closed semantics) 3-progress-comment-double-delete (consumed-flag, gated fallback, 404-DEBUG) This commit ships scaffolding only. Plan 3's implementation lands in this branch; plans 1 and 2 follow as their own PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): 017/3 lock * test: add coverage for buildExecutionPlan in secretOrchestrator * fix(triggers): suppress redundant progress-comment DELETE after gadget mid-run cleanup Plan 017/3 (progress-comment-double-delete). Closes failure mode C from spec 017's 24h log audit on 2026-04-29. Root cause: deleteProgressCommentOnSuccess (post-agent hook) read sessionState.initialCommentId, fell back to result.agentInput.ackCommentId when session state was empty, and issued a DELETE. But "session state cleared by an in-run gadget" was indistinguishable from "session state never populated", so the fallback fired and re-deleted comments that gadgets had already disposed of mid-run. GitHub returned 404 and produced 72 WARN entries per day on cascade-router (no functional impact — the comment WAS deleted — but the noise dominated WARN volume and obscured real failures). Three changes: 1. New `initialCommentIdConsumed: boolean` flag on SessionStateData (default false). Both `deleteInitialComment` (gadget-driven) and `clearInitialComment` (sidecar-driven) set it to true after disposing of the comment. Distinguishes "had a comment, now gone" from "never had one". 2. `deleteProgressCommentOnSuccess` reads the flag first; if true, skips the entire deletion path including the legacy `agentInput.ackCommentId` fallback. The legacy fallback continues to work for paths that never populated session state (consumed stays false). 3. `githubClient.deletePRComment` treats HTTP 404 as success-equivalent under RFC-7231 idempotency: returns without throwing, logs at DEBUG with the comment id. Other HTTP errors (5xx, 401, network) continue to throw. Defense in depth: even if a future regression of similar shape recurs, the WARN volume stays clean while a DEBUG breadcrumb persists for audit. Test coverage: +7 tests in sessionState (new `initialCommentIdConsumed flag` describe block), +3 in ack-comments (consumed-bypass + headline regression pin + legacy-fallback preservation), +4 in client (404 → DEBUG + 5xx/401/network rejection regression pins). Full suite: 470 files / 8649 tests passing (+14 from baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): 017/1 lock * fix(router): consolidate PM-ack dispatch via manifest registry, restore Linear coverage Plan 017/1 (pm-ack-coverage). Closes failure mode A from spec 017's 24h log audit on 2026-04-29: PM-focused agents (e.g. backlog-manager) triggered from a GitHub webhook against Linear-based projects silently skipped their PM-side ack. The router-adapter's local `postPMAck` helper had `if (pmType === 'trello')` / `if (pmType === 'jira')` literal branches but no Linear case, so control fell through to `WARN: Unknown PM type for PM-focused agent ack, skipping` — 24× per day on prod cascade-router, all from `ucho`. A near-identical helper at `src/triggers/shared/pm-ack.ts:postPMAckComment` had the Linear branch, so this was pure parallel-path drift between two helpers that should have been one (same shape as PR #1220). Three changes: 1. New consolidated helper `dispatchPMAck` at `src/router/pm-ack-dispatch.ts`. Indexes the manifest registry directly via `getPMProvider(pmType).platformClientFactory(projectId).postComment(...)`. Zero per-PM-type literal branching anywhere on the dispatch surface. Adding a future PM provider lands the dispatch path for free. 2. Both legacy call sites delegate: - `src/router/adapters/github.ts:postPMAck` (the buggy one) - `src/triggers/shared/pm-ack.ts:postPMAckComment` (had Linear; preserves the existing `string | null` return contract via String() normalization) 3. The "Unknown PM type" branch converts from silent WARN+skip to ERROR-level log + Sentry capture under stable tag `pm_ack_unknown_pm_type`. Mirrors the spec-015 `wedged_lock_canary` precedent. Once the consolidation ships, hitting that branch represents a real configuration error (project pinned to a deleted provider), not steady-state noise. Regression nets: - Per-provider conformance assertion: the existing PM manifest harness gains one new `it()` inside its `describe.each` block — `dispatchPMAck reaches this provider without throwing`. Adding a future provider whose `platformClientFactory` is misconfigured fails CI loudly. - Static guard against future literal-branching drift: `tests/unit/router/pm-ack-dispatch.test.ts` reads each of the three call sites' source and asserts no `pmType === 'trello' | 'jira' | 'linear'` patterns appear within their bodies. Modeled on `trigger-event-consistency.test.ts`. A future maintainer who hand-codes a branch fails this guard with a precise file:line citation. Test coverage: +6 tests on `dispatchPMAck` (Trello/JIRA/Linear happy paths, null-from-postComment, unknown-pmType Sentry path, undefined-pmType same), +4 on `postPMAckComment` delegation contract, +1 per-provider assertion in the conformance harness (4 currently-registered providers including the test fixture), +1 Linear regression pin on the github router adapter's `postAck` public surface, plus 2 existing github adapter tests migrated from mocking `postTrelloAck` to mocking `dispatchPMAck` (the dependency direction changed). Full suite: 472 files / 8668 tests (+19 from baseline). Doc updates: - `src/integrations/README.md` gains a "PM-ack dispatch coverage invariant" subsection in the conformance harness section, documenting the no-literal- branching rule, the Sentry-tag escalation, and pointing at both regression nets. - `CHANGELOG.md` entry under spec 017. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): 017/2 lock * fix(router): wrap PM-source dispatch in PM-provider scope so capacity gate enforces maxInFlightItems Plan 017/2 (capacity-gate-pm-scope). Closes failure mode B from spec 017's 24h log audit on 2026-04-29. Root cause: the pipeline-capacity gate at `src/triggers/shared/pipeline-capacity-gate.ts:33-45` calls `getPMProvider()` to count in-flight items. For every PM `status-changed` trigger this threw `No PMProvider in scope` because the three PM router adapters (`src/router/adapters/{linear,trello,jira}.ts`) wrapped trigger dispatch in their per-PM-type credential AsyncLocalStorage scope (`withLinearCredentials` / `withTrelloCredentials` / `withJiraCredentials`) but NOT in PM-provider scope. The GitHub adapter at `src/router/adapters/github.ts:280` already had both wrappings — the PM router adapters were the outliers. The gate caught the throw and fell through to its conservative branch: `WARN: pipeline-capacity-gate: PM provider unavailable, allowing run` + `return false` (allow). Net effect: `maxInFlightItems` was silently no-op for the entire PM-source path. 32 occurrences/day on prod cascade-router. The gate's whole purpose is preventing the multi-card-into-TODO incident class documented in the file header — and the protection had been disabled since the gate was first introduced. Three changes: 1. New shared helper `withPMScopeForDispatch(project, dispatch)` at `src/router/adapters/_shared.ts`. Resolves the PM provider via `createPMProvider(project)` and wraps `dispatch` in `withPMProvider`. Mirrors the GitHub adapter's correct shape. 2. The three PM router adapters consume the helper inside `dispatchWithCredentials`, layering PM-provider scope on top of their existing per-PM-type credential scope. 3. The capacity gate's "PM provider unavailable" branch flips from `WARN + return false` (allow) to ERROR + Sentry capture under stable tag `pipeline_capacity_gate_no_pm_provider` + `return true` (block). Once the routine path establishes scope (changes 1+2 above), hitting that branch represents a real AsyncLocalStorage scope leak that operators need to investigate. Failing closed is preferable to silently failing open and re-introducing the original incident class. Regression nets: - `tests/unit/integrations/pm-router-adapter-pm-scope.test.ts`: static guard reading each PM router adapter source and asserting it contains either `withPMScopeForDispatch` or `withPMProvider`. A future PM router adapter added without the wrapping fails this guard with a precise file path. Modeled on `trigger-event-consistency.test.ts`. - `tests/unit/router/adapters/with-pm-scope-for-dispatch.test.ts`: 4 tests covering happy path (PM provider resolves inside the wrapped callback), return-value passthrough, null dispatch result, and error propagation. - `pipeline-capacity-gate.test.ts`: existing "allows when no PM provider" test inverted to "FAILS CLOSED (blocks) when no PM provider scope" (the headline behavioral change), plus 2 new positive-path regression pins. Side effects on existing trigger tests: three trigger-handler test files (`linear-status-changed.test.ts`, `jira-status-changed.test.ts`, `status-changed.test.ts`) call `shouldBlockForPipelineCapacity` indirectly via the trigger handlers. After the fail-closed flip, those handlers were blocked because the unit tests don't establish PM-provider scope. Each test file gains a `vi.mock('.../pipeline-capacity-gate.js')` passthrough so trigger-logic assertions still run — the gate's behavior is tested in its own dedicated test file. Three PM router adapter test files (`linear.test.ts`, `trello.test.ts`, `jira.test.ts`) gain a similar passthrough mock for `_shared.js:withPMScopeForDispatch` so they don't pull the real PM manifest registry. Doc updates: - `CLAUDE.md` Architecture section gains a "Capacity-gate invariant" paragraph documenting the wrapping requirement, the fail-closed policy, the Sentry tag, and the static guard. - `CHANGELOG.md` entry under spec 017. Full suite: 474 files / 8677 tests passing (+9 from Plan 1's baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(spec): 017 done — all three plans complete (router-side silent-failure hardening) * fix(integration-tests): wrap implementation-trigger handle() in withPMProvider scope + plan 2 polish CI integration-tests caught a regression I missed locally because lefthook's pre-push only runs `npm run test:fast` (unit-only), not the full integration suite. Four integration tests asserted `result?.agentType === 'implementation'` but received `undefined` after plan 2's fail-closed flip, because they call real trigger `.handle(ctx)` directly without establishing PM-provider scope — the same class of breakage I fixed in unit tests with vi.mock passthroughs. Failing tests: - tests/integration/trigger-registry.test.ts:323 (TrelloStatusChangedTodoTrigger.handle) - tests/integration/trigger-registry.test.ts:591 (registry.dispatch with implementation trigger) - tests/integration/pm-provider-switching.test.ts:258 (Trello implementation dispatch) - tests/integration/pm-provider-switching.test.ts:293 (JIRA implementation dispatch) Integration tests don't mock the gate (by design), so the right fix is to mirror production wrapping: each call site now wraps in `withPMProvider(createPMProvider(project), () => trigger.handle(ctx))` — which is exactly what `withPMScopeForDispatch` does in the PM router adapters. The synthetic in-test trigger at trigger-registry.test.ts:117 doesn't call the gate (it returns a hard-coded TriggerResult), so no scope wrap needed there. Same for any non-implementation agent (planning / splitting / etc.) since `SLOT_CONSUMING_AGENTS` only contains 'implementation'. Also applies polish from local-review on the spec 017 work: 1. Reworded stale comment in `src/gadgets/sessionState.ts:186` — the comment claimed clearing `initialCommentId` made the post-agent callback "see null and short-circuit", but post-plan-3 the callback's actual gate is the `initialCommentIdConsumed` flag set on the line below. Comment now describes both the eager-clear AND the consumed-flag gate together. 2. Trimmed redundant 3-line wrapping comments to zero lines across the three PM router adapters (`linear.ts`, `trello.ts`, `jira.ts`). The shared helper's name (`withPMScopeForDispatch`) is self-documenting and CLAUDE.md's "Capacity-gate invariant" paragraph carries the rationale. Per CLAUDE.md "Default to writing no comments" guidance. 3. Tightened conformance harness assertion at `pm-conformance.test.ts`'s `dispatchPMAck reaches this provider without throwing` test. Previously the only `expect(...)` was conditional (`if (result !== undefined)`), so when result was undefined the test passed with zero assertions executed — a "tests that would still pass if the code under test were deleted" anti-pattern. Now uses `expect.assertions(1)` and asserts the result is either undefined OR has the expected AckResult shape, with a clear failure message naming the provider. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
zbigniewsobiecki
added a commit
that referenced
this pull request
Apr 29, 2026
…1227) * feat(alerting): add configurable PM results list for alerting agent + enable UX * fix(alerting): map PM provider to correct discovery capability in container picker Replace the non-existent `containers` capability (which caused NOT_IMPLEMENTED errors for every provider) with provider-specific mappings: Trello→`boards`, JIRA→`projects`, Linear→`teams`. Disable the Fetch button with an explanatory tooltip for unknown providers. Also simplify the `alertingResultsContainerId ?? undefined` ternary to a plain `??` expression in promptContext.ts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(triggers): persist workItemId on respond-to-review and respond-to-pr-comment runs Both GitHub triggers set `workItemId` at the top level of their TriggerResult but omitted it from `agentInput`. tryCreateRun reads workItemId from agentInput, so 0/103 respond-to-review and 0/9 respond-to-pr-comment runs had a non-null `agent_runs.work_item_id` — hiding every one of them from the dashboard's work-item page (which filters by that column). Confirmed on 2026-04-29 via `cascade runs list --project ucho`: four respond-to-review runs for ucho/MNG-400 (PR #136) on 2026-04-28 were invisible despite firing successfully. The runtime safety net in `runAgentExecutionPipeline` was supposed to catch this but only patched when re-resolution returned a value DIFFERENT from `result.workItemId`. Since both triggers populated `result.workItemId` correctly, the safety net no-op'd. Three changes: 1. Add `workItemId` (shorthand) to the `agentInput` literal in `pr-review-submitted.ts:82` and `pr-comment-mention.ts:143`. 2. Widen the runtime patch in `agent-execution.ts:498` so it triggers on `agentInput.workItemId !== resolvedWorkItemId` rather than on `result.workItemId !== resolvedWorkItemId`. Now any future trigger that forgets the field still gets repaired before it reaches the DB. 3. Add `tests/unit/triggers/trigger-work-item-id-consistency.test.ts` — a static-analysis guard modeled on `trigger-event-consistency.test.ts`. Walks every TriggerHandler file via the TypeScript AST and asserts that any return literal with a top-level `workItemId` and a real `agentType` also includes `workItemId` inside `agentInput`. Fires loudly with `file:line` when violated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(alerting): fix PM container picker to fetch correct sub-container level - JIRA: switch from `projects` to `states` capability (fetches workflow status names, which is what backlogListId expects); pass configured projectKey as args.containerId; save item.name not numeric state ID - Trello: disable the dropdown picker entirely — no `lists` discovery capability exists in the manifest; users must enter the list ID manually - Linear: keep `teams` capability (team ID is already the correct value) Extract ContainerInput helper component to keep AlertingTab complexity below the project's cognitive-complexity limit. Pass pmConfig from integration-form.tsx so JIRA states can be scoped to the right project. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(spec): 017 router-silent-failure-hardening + plans Three production hardening fixes derived from the 2026-04-29 24h log/webhook audit. Discovered alongside PR #1220 (workItemId-on-respond-to-* fix). Failure modes: A) Linear PM-ack silently skipped on PM-focused agents (24/day) B) Pipeline-capacity gate fails open on every PM status-changed (32/day) C) Progress-comment double-delete race produces 404 log spam (72/day) Decomposed per spec strategic decision #8 into three independent plans (none blocks another): 1-pm-ack-coverage (consolidate dispatch via manifest registry) 2-capacity-gate-pm-scope (shared adapter helper, fail-closed semantics) 3-progress-comment-double-delete (consumed-flag, gated fallback, 404-DEBUG) This commit ships scaffolding only. Plan 3's implementation lands in this branch; plans 1 and 2 follow as their own PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): 017/3 lock * test: add coverage for buildExecutionPlan in secretOrchestrator * fix(triggers): suppress redundant progress-comment DELETE after gadget mid-run cleanup Plan 017/3 (progress-comment-double-delete). Closes failure mode C from spec 017's 24h log audit on 2026-04-29. Root cause: deleteProgressCommentOnSuccess (post-agent hook) read sessionState.initialCommentId, fell back to result.agentInput.ackCommentId when session state was empty, and issued a DELETE. But "session state cleared by an in-run gadget" was indistinguishable from "session state never populated", so the fallback fired and re-deleted comments that gadgets had already disposed of mid-run. GitHub returned 404 and produced 72 WARN entries per day on cascade-router (no functional impact — the comment WAS deleted — but the noise dominated WARN volume and obscured real failures). Three changes: 1. New `initialCommentIdConsumed: boolean` flag on SessionStateData (default false). Both `deleteInitialComment` (gadget-driven) and `clearInitialComment` (sidecar-driven) set it to true after disposing of the comment. Distinguishes "had a comment, now gone" from "never had one". 2. `deleteProgressCommentOnSuccess` reads the flag first; if true, skips the entire deletion path including the legacy `agentInput.ackCommentId` fallback. The legacy fallback continues to work for paths that never populated session state (consumed stays false). 3. `githubClient.deletePRComment` treats HTTP 404 as success-equivalent under RFC-7231 idempotency: returns without throwing, logs at DEBUG with the comment id. Other HTTP errors (5xx, 401, network) continue to throw. Defense in depth: even if a future regression of similar shape recurs, the WARN volume stays clean while a DEBUG breadcrumb persists for audit. Test coverage: +7 tests in sessionState (new `initialCommentIdConsumed flag` describe block), +3 in ack-comments (consumed-bypass + headline regression pin + legacy-fallback preservation), +4 in client (404 → DEBUG + 5xx/401/network rejection regression pins). Full suite: 470 files / 8649 tests passing (+14 from baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): 017/1 lock * fix(router): consolidate PM-ack dispatch via manifest registry, restore Linear coverage Plan 017/1 (pm-ack-coverage). Closes failure mode A from spec 017's 24h log audit on 2026-04-29: PM-focused agents (e.g. backlog-manager) triggered from a GitHub webhook against Linear-based projects silently skipped their PM-side ack. The router-adapter's local `postPMAck` helper had `if (pmType === 'trello')` / `if (pmType === 'jira')` literal branches but no Linear case, so control fell through to `WARN: Unknown PM type for PM-focused agent ack, skipping` — 24× per day on prod cascade-router, all from `ucho`. A near-identical helper at `src/triggers/shared/pm-ack.ts:postPMAckComment` had the Linear branch, so this was pure parallel-path drift between two helpers that should have been one (same shape as PR #1220). Three changes: 1. New consolidated helper `dispatchPMAck` at `src/router/pm-ack-dispatch.ts`. Indexes the manifest registry directly via `getPMProvider(pmType).platformClientFactory(projectId).postComment(...)`. Zero per-PM-type literal branching anywhere on the dispatch surface. Adding a future PM provider lands the dispatch path for free. 2. Both legacy call sites delegate: - `src/router/adapters/github.ts:postPMAck` (the buggy one) - `src/triggers/shared/pm-ack.ts:postPMAckComment` (had Linear; preserves the existing `string | null` return contract via String() normalization) 3. The "Unknown PM type" branch converts from silent WARN+skip to ERROR-level log + Sentry capture under stable tag `pm_ack_unknown_pm_type`. Mirrors the spec-015 `wedged_lock_canary` precedent. Once the consolidation ships, hitting that branch represents a real configuration error (project pinned to a deleted provider), not steady-state noise. Regression nets: - Per-provider conformance assertion: the existing PM manifest harness gains one new `it()` inside its `describe.each` block — `dispatchPMAck reaches this provider without throwing`. Adding a future provider whose `platformClientFactory` is misconfigured fails CI loudly. - Static guard against future literal-branching drift: `tests/unit/router/pm-ack-dispatch.test.ts` reads each of the three call sites' source and asserts no `pmType === 'trello' | 'jira' | 'linear'` patterns appear within their bodies. Modeled on `trigger-event-consistency.test.ts`. A future maintainer who hand-codes a branch fails this guard with a precise file:line citation. Test coverage: +6 tests on `dispatchPMAck` (Trello/JIRA/Linear happy paths, null-from-postComment, unknown-pmType Sentry path, undefined-pmType same), +4 on `postPMAckComment` delegation contract, +1 per-provider assertion in the conformance harness (4 currently-registered providers including the test fixture), +1 Linear regression pin on the github router adapter's `postAck` public surface, plus 2 existing github adapter tests migrated from mocking `postTrelloAck` to mocking `dispatchPMAck` (the dependency direction changed). Full suite: 472 files / 8668 tests (+19 from baseline). Doc updates: - `src/integrations/README.md` gains a "PM-ack dispatch coverage invariant" subsection in the conformance harness section, documenting the no-literal- branching rule, the Sentry-tag escalation, and pointing at both regression nets. - `CHANGELOG.md` entry under spec 017. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): 017/2 lock * fix(router): wrap PM-source dispatch in PM-provider scope so capacity gate enforces maxInFlightItems Plan 017/2 (capacity-gate-pm-scope). Closes failure mode B from spec 017's 24h log audit on 2026-04-29. Root cause: the pipeline-capacity gate at `src/triggers/shared/pipeline-capacity-gate.ts:33-45` calls `getPMProvider()` to count in-flight items. For every PM `status-changed` trigger this threw `No PMProvider in scope` because the three PM router adapters (`src/router/adapters/{linear,trello,jira}.ts`) wrapped trigger dispatch in their per-PM-type credential AsyncLocalStorage scope (`withLinearCredentials` / `withTrelloCredentials` / `withJiraCredentials`) but NOT in PM-provider scope. The GitHub adapter at `src/router/adapters/github.ts:280` already had both wrappings — the PM router adapters were the outliers. The gate caught the throw and fell through to its conservative branch: `WARN: pipeline-capacity-gate: PM provider unavailable, allowing run` + `return false` (allow). Net effect: `maxInFlightItems` was silently no-op for the entire PM-source path. 32 occurrences/day on prod cascade-router. The gate's whole purpose is preventing the multi-card-into-TODO incident class documented in the file header — and the protection had been disabled since the gate was first introduced. Three changes: 1. New shared helper `withPMScopeForDispatch(project, dispatch)` at `src/router/adapters/_shared.ts`. Resolves the PM provider via `createPMProvider(project)` and wraps `dispatch` in `withPMProvider`. Mirrors the GitHub adapter's correct shape. 2. The three PM router adapters consume the helper inside `dispatchWithCredentials`, layering PM-provider scope on top of their existing per-PM-type credential scope. 3. The capacity gate's "PM provider unavailable" branch flips from `WARN + return false` (allow) to ERROR + Sentry capture under stable tag `pipeline_capacity_gate_no_pm_provider` + `return true` (block). Once the routine path establishes scope (changes 1+2 above), hitting that branch represents a real AsyncLocalStorage scope leak that operators need to investigate. Failing closed is preferable to silently failing open and re-introducing the original incident class. Regression nets: - `tests/unit/integrations/pm-router-adapter-pm-scope.test.ts`: static guard reading each PM router adapter source and asserting it contains either `withPMScopeForDispatch` or `withPMProvider`. A future PM router adapter added without the wrapping fails this guard with a precise file path. Modeled on `trigger-event-consistency.test.ts`. - `tests/unit/router/adapters/with-pm-scope-for-dispatch.test.ts`: 4 tests covering happy path (PM provider resolves inside the wrapped callback), return-value passthrough, null dispatch result, and error propagation. - `pipeline-capacity-gate.test.ts`: existing "allows when no PM provider" test inverted to "FAILS CLOSED (blocks) when no PM provider scope" (the headline behavioral change), plus 2 new positive-path regression pins. Side effects on existing trigger tests: three trigger-handler test files (`linear-status-changed.test.ts`, `jira-status-changed.test.ts`, `status-changed.test.ts`) call `shouldBlockForPipelineCapacity` indirectly via the trigger handlers. After the fail-closed flip, those handlers were blocked because the unit tests don't establish PM-provider scope. Each test file gains a `vi.mock('.../pipeline-capacity-gate.js')` passthrough so trigger-logic assertions still run — the gate's behavior is tested in its own dedicated test file. Three PM router adapter test files (`linear.test.ts`, `trello.test.ts`, `jira.test.ts`) gain a similar passthrough mock for `_shared.js:withPMScopeForDispatch` so they don't pull the real PM manifest registry. Doc updates: - `CLAUDE.md` Architecture section gains a "Capacity-gate invariant" paragraph documenting the wrapping requirement, the fail-closed policy, the Sentry tag, and the static guard. - `CHANGELOG.md` entry under spec 017. Full suite: 474 files / 8677 tests passing (+9 from Plan 1's baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(spec): 017 done — all three plans complete (router-side silent-failure hardening) * fix(integration-tests): wrap implementation-trigger handle() in withPMProvider scope + plan 2 polish CI integration-tests caught a regression I missed locally because lefthook's pre-push only runs `npm run test:fast` (unit-only), not the full integration suite. Four integration tests asserted `result?.agentType === 'implementation'` but received `undefined` after plan 2's fail-closed flip, because they call real trigger `.handle(ctx)` directly without establishing PM-provider scope — the same class of breakage I fixed in unit tests with vi.mock passthroughs. Failing tests: - tests/integration/trigger-registry.test.ts:323 (TrelloStatusChangedTodoTrigger.handle) - tests/integration/trigger-registry.test.ts:591 (registry.dispatch with implementation trigger) - tests/integration/pm-provider-switching.test.ts:258 (Trello implementation dispatch) - tests/integration/pm-provider-switching.test.ts:293 (JIRA implementation dispatch) Integration tests don't mock the gate (by design), so the right fix is to mirror production wrapping: each call site now wraps in `withPMProvider(createPMProvider(project), () => trigger.handle(ctx))` — which is exactly what `withPMScopeForDispatch` does in the PM router adapters. The synthetic in-test trigger at trigger-registry.test.ts:117 doesn't call the gate (it returns a hard-coded TriggerResult), so no scope wrap needed there. Same for any non-implementation agent (planning / splitting / etc.) since `SLOT_CONSUMING_AGENTS` only contains 'implementation'. Also applies polish from local-review on the spec 017 work: 1. Reworded stale comment in `src/gadgets/sessionState.ts:186` — the comment claimed clearing `initialCommentId` made the post-agent callback "see null and short-circuit", but post-plan-3 the callback's actual gate is the `initialCommentIdConsumed` flag set on the line below. Comment now describes both the eager-clear AND the consumed-flag gate together. 2. Trimmed redundant 3-line wrapping comments to zero lines across the three PM router adapters (`linear.ts`, `trello.ts`, `jira.ts`). The shared helper's name (`withPMScopeForDispatch`) is self-documenting and CLAUDE.md's "Capacity-gate invariant" paragraph carries the rationale. Per CLAUDE.md "Default to writing no comments" guidance. 3. Tightened conformance harness assertion at `pm-conformance.test.ts`'s `dispatchPMAck reaches this provider without throwing` test. Previously the only `expect(...)` was conditional (`if (result !== undefined)`), so when result was undefined the test passed with zero assertions executed — a "tests that would still pass if the code under test were deleted" anti-pattern. Now uses `expect.assertions(1)` and asserts the result is either undefined OR has the expected AckResult shape, with a clear failure message naming the provider. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(router): replace in-memory PM coalesce window with BullMQ delayed-job coalescing * ci: add Redis service to integration-tests job The new tests/integration/coalesce-bullmq.test.ts requires a live Redis instance to verify BullMQ delayed-job supersede semantics, but the CI integration-tests job only provisioned Postgres. Add a redis:7-alpine service on the standard 6379 port and pass REDIS_URL through to the test step so the BullMQ Queue can connect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(router): address review feedback on PM coalesce deferred ack - worker-entry: call generateAckMessage() at job fire time instead of using workItemTitle as the literal comment text. Adds postDeferredAck() shared helper that extracts context from the stored payload (same path as the non-coalesced adapter.postAck), generates a proper role-aware ack message via LLM/fallback, then calls dispatchPMAck. Eliminates the 3x copy-paste deferred ack blocks (nitpick fix). - queue.ts: document the TOCTOU race in scheduleCoalescedJob (getJob → getState → remove → add is not atomic; two concurrent handlers can lose the second event's data). Practical impact is low but documented for future maintainers. - webhook-processor.ts + types: update ackMessage field docstrings to clarify it stores the workItemTitle as a context hint for generateAckMessage at fire time, not a pre-generated comment text. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore(worker): remove now-unused biome complexity suppression The noExcessiveCognitiveComplexity suppression on dispatchJob is no longer needed — extracting postDeferredAck() reduced the function's complexity below the threshold. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(router): address lock-leak and active-job-status issues in coalesced dispatch Two lock-lifecycle bugs in BullMQ delayed-job coalescing: 1. Lock leak on supersede: when scheduleCoalescedJob removes a delayed job (returns superseded=true), the superseded job's in-memory locks are never released because BullMQ's remove() does not fire worker.on('failed'). Fix: return supersededJobData from scheduleCoalescedJob so the caller can call clearWorkItemEnqueued / clearAgentTypeEnqueued / clearRecentlyDispatched for the old job. 2. Misleading status when active job exists: when getState() returns 'active', the previous code called add() (BullMQ silently ignores it for duplicate active IDs) and returned superseded=false, causing the caller to mark locks for a non-existent job. Fix: return activeExists=true and skip add() entirely; caller skips lock marking and logs accurately. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(coverage): cover deferred-ack worker path + coalesce-config Adds 8 unit tests for `getCoalesceWindowMs()` (every branch — default, PM_COALESCE_WINDOW_MS, legacy fallback, precedence, 0/disable, non-numeric, negative, empty) and 9 unit tests for the deferred-ack flow in `dispatchJob` / `postDeferredAck` (Trello/JIRA/Linear pendingAck paths, payload-extractor → contextHint fallback, missing agentType, non-fatal `dispatchPMAck` rejection, undefined-result fallback to the pre-existing ackCommentId, Linear-without-workItemId skip, plus the non-pendingAck Linear routing branch that was wholly uncovered). Closes the codecov/patch gap on PR #1226 — the new BullMQ deferred-ack codepath added in this PR was at 15% (worker-entry.ts) / 0% (coalesce-config.ts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(router): capture coalesce-schedule failures to Sentry + rename ackMessage hint field Addresses both items from the (failed-to-submit) review run c63c7835: 1. **Should-fix: Sentry capture gap in coalesce-schedule catch block.** The `scheduleCoalescedJob` failure path in `webhook-processor.ts` logged via `logger.error` but did not call `captureException`. Other dispatch- failure paths flow through BullMQ retry → `worker.on('failed')` → `releaseLocksForFailedJob` → Sentry (per spec 015 plan 1), but this catch handles a Redis-side failure BEFORE the job is enqueued, so it bypasses that pipeline entirely. Redis hiccups during coalesced scheduling were silently escaping observability. Now captures under stable tag `coalesce_schedule_failure` with `{projectId, workItemId, agentType, coalesceKey, adapterType}` extras. 2. **Nit: rename `ackMessage` → `ackContextHint` on the new job interfaces.** The PR added an `ackMessage?: string` field on TrelloJob/JiraJob/LinearJob (and corresponding `*JobData` interfaces in worker-entry.ts) to store the `workItemTitle` as a context hint for `generateAckMessage()` at deferred- ack fire time — explicitly NOT the literal comment text, per the JSDoc. The name read like the literal text and required reading the JSDoc to avoid misuse. Renamed to `ackContextHint` for self-documentation. The pre-existing `GitHubJob.ackMessage` field is left unchanged because that one IS the literal comment text. Renamed 3 type fields (Trello/JIRA/Linear job interfaces in queue.ts + matching shapes in worker-entry.ts), 1 set site in webhook-processor.ts, 3 read sites in worker-entry.ts's deferred-ack helper invocations, plus the JSDoc reference and 4 test mocks (test names included). Full unit suite green (473 files / 8700 tests). Typecheck + lint clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
zbigniewsobiecki
added a commit
that referenced
this pull request
Apr 29, 2026
…ontainer names (#1229) * feat(alerting): add configurable PM results list for alerting agent + enable UX * fix(alerting): map PM provider to correct discovery capability in container picker Replace the non-existent `containers` capability (which caused NOT_IMPLEMENTED errors for every provider) with provider-specific mappings: Trello→`boards`, JIRA→`projects`, Linear→`teams`. Disable the Fetch button with an explanatory tooltip for unknown providers. Also simplify the `alertingResultsContainerId ?? undefined` ternary to a plain `??` expression in promptContext.ts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(triggers): persist workItemId on respond-to-review and respond-to-pr-comment runs Both GitHub triggers set `workItemId` at the top level of their TriggerResult but omitted it from `agentInput`. tryCreateRun reads workItemId from agentInput, so 0/103 respond-to-review and 0/9 respond-to-pr-comment runs had a non-null `agent_runs.work_item_id` — hiding every one of them from the dashboard's work-item page (which filters by that column). Confirmed on 2026-04-29 via `cascade runs list --project ucho`: four respond-to-review runs for ucho/MNG-400 (PR #136) on 2026-04-28 were invisible despite firing successfully. The runtime safety net in `runAgentExecutionPipeline` was supposed to catch this but only patched when re-resolution returned a value DIFFERENT from `result.workItemId`. Since both triggers populated `result.workItemId` correctly, the safety net no-op'd. Three changes: 1. Add `workItemId` (shorthand) to the `agentInput` literal in `pr-review-submitted.ts:82` and `pr-comment-mention.ts:143`. 2. Widen the runtime patch in `agent-execution.ts:498` so it triggers on `agentInput.workItemId !== resolvedWorkItemId` rather than on `result.workItemId !== resolvedWorkItemId`. Now any future trigger that forgets the field still gets repaired before it reaches the DB. 3. Add `tests/unit/triggers/trigger-work-item-id-consistency.test.ts` — a static-analysis guard modeled on `trigger-event-consistency.test.ts`. Walks every TriggerHandler file via the TypeScript AST and asserts that any return literal with a top-level `workItemId` and a real `agentType` also includes `workItemId` inside `agentInput`. Fires loudly with `file:line` when violated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(alerting): fix PM container picker to fetch correct sub-container level - JIRA: switch from `projects` to `states` capability (fetches workflow status names, which is what backlogListId expects); pass configured projectKey as args.containerId; save item.name not numeric state ID - Trello: disable the dropdown picker entirely — no `lists` discovery capability exists in the manifest; users must enter the list ID manually - Linear: keep `teams` capability (team ID is already the correct value) Extract ContainerInput helper component to keep AlertingTab complexity below the project's cognitive-complexity limit. Pass pmConfig from integration-form.tsx so JIRA states can be scoped to the right project. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(spec): 017 router-silent-failure-hardening + plans Three production hardening fixes derived from the 2026-04-29 24h log/webhook audit. Discovered alongside PR #1220 (workItemId-on-respond-to-* fix). Failure modes: A) Linear PM-ack silently skipped on PM-focused agents (24/day) B) Pipeline-capacity gate fails open on every PM status-changed (32/day) C) Progress-comment double-delete race produces 404 log spam (72/day) Decomposed per spec strategic decision #8 into three independent plans (none blocks another): 1-pm-ack-coverage (consolidate dispatch via manifest registry) 2-capacity-gate-pm-scope (shared adapter helper, fail-closed semantics) 3-progress-comment-double-delete (consumed-flag, gated fallback, 404-DEBUG) This commit ships scaffolding only. Plan 3's implementation lands in this branch; plans 1 and 2 follow as their own PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): 017/3 lock * test: add coverage for buildExecutionPlan in secretOrchestrator * fix(triggers): suppress redundant progress-comment DELETE after gadget mid-run cleanup Plan 017/3 (progress-comment-double-delete). Closes failure mode C from spec 017's 24h log audit on 2026-04-29. Root cause: deleteProgressCommentOnSuccess (post-agent hook) read sessionState.initialCommentId, fell back to result.agentInput.ackCommentId when session state was empty, and issued a DELETE. But "session state cleared by an in-run gadget" was indistinguishable from "session state never populated", so the fallback fired and re-deleted comments that gadgets had already disposed of mid-run. GitHub returned 404 and produced 72 WARN entries per day on cascade-router (no functional impact — the comment WAS deleted — but the noise dominated WARN volume and obscured real failures). Three changes: 1. New `initialCommentIdConsumed: boolean` flag on SessionStateData (default false). Both `deleteInitialComment` (gadget-driven) and `clearInitialComment` (sidecar-driven) set it to true after disposing of the comment. Distinguishes "had a comment, now gone" from "never had one". 2. `deleteProgressCommentOnSuccess` reads the flag first; if true, skips the entire deletion path including the legacy `agentInput.ackCommentId` fallback. The legacy fallback continues to work for paths that never populated session state (consumed stays false). 3. `githubClient.deletePRComment` treats HTTP 404 as success-equivalent under RFC-7231 idempotency: returns without throwing, logs at DEBUG with the comment id. Other HTTP errors (5xx, 401, network) continue to throw. Defense in depth: even if a future regression of similar shape recurs, the WARN volume stays clean while a DEBUG breadcrumb persists for audit. Test coverage: +7 tests in sessionState (new `initialCommentIdConsumed flag` describe block), +3 in ack-comments (consumed-bypass + headline regression pin + legacy-fallback preservation), +4 in client (404 → DEBUG + 5xx/401/network rejection regression pins). Full suite: 470 files / 8649 tests passing (+14 from baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): 017/1 lock * fix(router): consolidate PM-ack dispatch via manifest registry, restore Linear coverage Plan 017/1 (pm-ack-coverage). Closes failure mode A from spec 017's 24h log audit on 2026-04-29: PM-focused agents (e.g. backlog-manager) triggered from a GitHub webhook against Linear-based projects silently skipped their PM-side ack. The router-adapter's local `postPMAck` helper had `if (pmType === 'trello')` / `if (pmType === 'jira')` literal branches but no Linear case, so control fell through to `WARN: Unknown PM type for PM-focused agent ack, skipping` — 24× per day on prod cascade-router, all from `ucho`. A near-identical helper at `src/triggers/shared/pm-ack.ts:postPMAckComment` had the Linear branch, so this was pure parallel-path drift between two helpers that should have been one (same shape as PR #1220). Three changes: 1. New consolidated helper `dispatchPMAck` at `src/router/pm-ack-dispatch.ts`. Indexes the manifest registry directly via `getPMProvider(pmType).platformClientFactory(projectId).postComment(...)`. Zero per-PM-type literal branching anywhere on the dispatch surface. Adding a future PM provider lands the dispatch path for free. 2. Both legacy call sites delegate: - `src/router/adapters/github.ts:postPMAck` (the buggy one) - `src/triggers/shared/pm-ack.ts:postPMAckComment` (had Linear; preserves the existing `string | null` return contract via String() normalization) 3. The "Unknown PM type" branch converts from silent WARN+skip to ERROR-level log + Sentry capture under stable tag `pm_ack_unknown_pm_type`. Mirrors the spec-015 `wedged_lock_canary` precedent. Once the consolidation ships, hitting that branch represents a real configuration error (project pinned to a deleted provider), not steady-state noise. Regression nets: - Per-provider conformance assertion: the existing PM manifest harness gains one new `it()` inside its `describe.each` block — `dispatchPMAck reaches this provider without throwing`. Adding a future provider whose `platformClientFactory` is misconfigured fails CI loudly. - Static guard against future literal-branching drift: `tests/unit/router/pm-ack-dispatch.test.ts` reads each of the three call sites' source and asserts no `pmType === 'trello' | 'jira' | 'linear'` patterns appear within their bodies. Modeled on `trigger-event-consistency.test.ts`. A future maintainer who hand-codes a branch fails this guard with a precise file:line citation. Test coverage: +6 tests on `dispatchPMAck` (Trello/JIRA/Linear happy paths, null-from-postComment, unknown-pmType Sentry path, undefined-pmType same), +4 on `postPMAckComment` delegation contract, +1 per-provider assertion in the conformance harness (4 currently-registered providers including the test fixture), +1 Linear regression pin on the github router adapter's `postAck` public surface, plus 2 existing github adapter tests migrated from mocking `postTrelloAck` to mocking `dispatchPMAck` (the dependency direction changed). Full suite: 472 files / 8668 tests (+19 from baseline). Doc updates: - `src/integrations/README.md` gains a "PM-ack dispatch coverage invariant" subsection in the conformance harness section, documenting the no-literal- branching rule, the Sentry-tag escalation, and pointing at both regression nets. - `CHANGELOG.md` entry under spec 017. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan): 017/2 lock * fix(router): wrap PM-source dispatch in PM-provider scope so capacity gate enforces maxInFlightItems Plan 017/2 (capacity-gate-pm-scope). Closes failure mode B from spec 017's 24h log audit on 2026-04-29. Root cause: the pipeline-capacity gate at `src/triggers/shared/pipeline-capacity-gate.ts:33-45` calls `getPMProvider()` to count in-flight items. For every PM `status-changed` trigger this threw `No PMProvider in scope` because the three PM router adapters (`src/router/adapters/{linear,trello,jira}.ts`) wrapped trigger dispatch in their per-PM-type credential AsyncLocalStorage scope (`withLinearCredentials` / `withTrelloCredentials` / `withJiraCredentials`) but NOT in PM-provider scope. The GitHub adapter at `src/router/adapters/github.ts:280` already had both wrappings — the PM router adapters were the outliers. The gate caught the throw and fell through to its conservative branch: `WARN: pipeline-capacity-gate: PM provider unavailable, allowing run` + `return false` (allow). Net effect: `maxInFlightItems` was silently no-op for the entire PM-source path. 32 occurrences/day on prod cascade-router. The gate's whole purpose is preventing the multi-card-into-TODO incident class documented in the file header — and the protection had been disabled since the gate was first introduced. Three changes: 1. New shared helper `withPMScopeForDispatch(project, dispatch)` at `src/router/adapters/_shared.ts`. Resolves the PM provider via `createPMProvider(project)` and wraps `dispatch` in `withPMProvider`. Mirrors the GitHub adapter's correct shape. 2. The three PM router adapters consume the helper inside `dispatchWithCredentials`, layering PM-provider scope on top of their existing per-PM-type credential scope. 3. The capacity gate's "PM provider unavailable" branch flips from `WARN + return false` (allow) to ERROR + Sentry capture under stable tag `pipeline_capacity_gate_no_pm_provider` + `return true` (block). Once the routine path establishes scope (changes 1+2 above), hitting that branch represents a real AsyncLocalStorage scope leak that operators need to investigate. Failing closed is preferable to silently failing open and re-introducing the original incident class. Regression nets: - `tests/unit/integrations/pm-router-adapter-pm-scope.test.ts`: static guard reading each PM router adapter source and asserting it contains either `withPMScopeForDispatch` or `withPMProvider`. A future PM router adapter added without the wrapping fails this guard with a precise file path. Modeled on `trigger-event-consistency.test.ts`. - `tests/unit/router/adapters/with-pm-scope-for-dispatch.test.ts`: 4 tests covering happy path (PM provider resolves inside the wrapped callback), return-value passthrough, null dispatch result, and error propagation. - `pipeline-capacity-gate.test.ts`: existing "allows when no PM provider" test inverted to "FAILS CLOSED (blocks) when no PM provider scope" (the headline behavioral change), plus 2 new positive-path regression pins. Side effects on existing trigger tests: three trigger-handler test files (`linear-status-changed.test.ts`, `jira-status-changed.test.ts`, `status-changed.test.ts`) call `shouldBlockForPipelineCapacity` indirectly via the trigger handlers. After the fail-closed flip, those handlers were blocked because the unit tests don't establish PM-provider scope. Each test file gains a `vi.mock('.../pipeline-capacity-gate.js')` passthrough so trigger-logic assertions still run — the gate's behavior is tested in its own dedicated test file. Three PM router adapter test files (`linear.test.ts`, `trello.test.ts`, `jira.test.ts`) gain a similar passthrough mock for `_shared.js:withPMScopeForDispatch` so they don't pull the real PM manifest registry. Doc updates: - `CLAUDE.md` Architecture section gains a "Capacity-gate invariant" paragraph documenting the wrapping requirement, the fail-closed policy, the Sentry tag, and the static guard. - `CHANGELOG.md` entry under spec 017. Full suite: 474 files / 8677 tests passing (+9 from Plan 1's baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(spec): 017 done — all three plans complete (router-side silent-failure hardening) * fix(integration-tests): wrap implementation-trigger handle() in withPMProvider scope + plan 2 polish CI integration-tests caught a regression I missed locally because lefthook's pre-push only runs `npm run test:fast` (unit-only), not the full integration suite. Four integration tests asserted `result?.agentType === 'implementation'` but received `undefined` after plan 2's fail-closed flip, because they call real trigger `.handle(ctx)` directly without establishing PM-provider scope — the same class of breakage I fixed in unit tests with vi.mock passthroughs. Failing tests: - tests/integration/trigger-registry.test.ts:323 (TrelloStatusChangedTodoTrigger.handle) - tests/integration/trigger-registry.test.ts:591 (registry.dispatch with implementation trigger) - tests/integration/pm-provider-switching.test.ts:258 (Trello implementation dispatch) - tests/integration/pm-provider-switching.test.ts:293 (JIRA implementation dispatch) Integration tests don't mock the gate (by design), so the right fix is to mirror production wrapping: each call site now wraps in `withPMProvider(createPMProvider(project), () => trigger.handle(ctx))` — which is exactly what `withPMScopeForDispatch` does in the PM router adapters. The synthetic in-test trigger at trigger-registry.test.ts:117 doesn't call the gate (it returns a hard-coded TriggerResult), so no scope wrap needed there. Same for any non-implementation agent (planning / splitting / etc.) since `SLOT_CONSUMING_AGENTS` only contains 'implementation'. Also applies polish from local-review on the spec 017 work: 1. Reworded stale comment in `src/gadgets/sessionState.ts:186` — the comment claimed clearing `initialCommentId` made the post-agent callback "see null and short-circuit", but post-plan-3 the callback's actual gate is the `initialCommentIdConsumed` flag set on the line below. Comment now describes both the eager-clear AND the consumed-flag gate together. 2. Trimmed redundant 3-line wrapping comments to zero lines across the three PM router adapters (`linear.ts`, `trello.ts`, `jira.ts`). The shared helper's name (`withPMScopeForDispatch`) is self-documenting and CLAUDE.md's "Capacity-gate invariant" paragraph carries the rationale. Per CLAUDE.md "Default to writing no comments" guidance. 3. Tightened conformance harness assertion at `pm-conformance.test.ts`'s `dispatchPMAck reaches this provider without throwing` test. Previously the only `expect(...)` was conditional (`if (result !== undefined)`), so when result was undefined the test passed with zero assertions executed — a "tests that would still pass if the code under test were deleted" anti-pattern. Now uses `expect.assertions(1)` and asserts the result is either undefined OR has the expected AckResult shape, with a clear failure message naming the provider. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(router): replace in-memory PM coalesce window with BullMQ delayed-job coalescing * ci: add Redis service to integration-tests job The new tests/integration/coalesce-bullmq.test.ts requires a live Redis instance to verify BullMQ delayed-job supersede semantics, but the CI integration-tests job only provisioned Postgres. Add a redis:7-alpine service on the standard 6379 port and pass REDIS_URL through to the test step so the BullMQ Queue can connect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(router): address review feedback on PM coalesce deferred ack - worker-entry: call generateAckMessage() at job fire time instead of using workItemTitle as the literal comment text. Adds postDeferredAck() shared helper that extracts context from the stored payload (same path as the non-coalesced adapter.postAck), generates a proper role-aware ack message via LLM/fallback, then calls dispatchPMAck. Eliminates the 3x copy-paste deferred ack blocks (nitpick fix). - queue.ts: document the TOCTOU race in scheduleCoalescedJob (getJob → getState → remove → add is not atomic; two concurrent handlers can lose the second event's data). Practical impact is low but documented for future maintainers. - webhook-processor.ts + types: update ackMessage field docstrings to clarify it stores the workItemTitle as a context hint for generateAckMessage at fire time, not a pre-generated comment text. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore(worker): remove now-unused biome complexity suppression The noExcessiveCognitiveComplexity suppression on dispatchJob is no longer needed — extracting postDeferredAck() reduced the function's complexity below the threshold. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(router): address lock-leak and active-job-status issues in coalesced dispatch Two lock-lifecycle bugs in BullMQ delayed-job coalescing: 1. Lock leak on supersede: when scheduleCoalescedJob removes a delayed job (returns superseded=true), the superseded job's in-memory locks are never released because BullMQ's remove() does not fire worker.on('failed'). Fix: return supersededJobData from scheduleCoalescedJob so the caller can call clearWorkItemEnqueued / clearAgentTypeEnqueued / clearRecentlyDispatched for the old job. 2. Misleading status when active job exists: when getState() returns 'active', the previous code called add() (BullMQ silently ignores it for duplicate active IDs) and returned superseded=false, causing the caller to mark locks for a non-existent job. Fix: return activeExists=true and skip add() entirely; caller skips lock marking and logs accurately. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(coverage): cover deferred-ack worker path + coalesce-config Adds 8 unit tests for `getCoalesceWindowMs()` (every branch — default, PM_COALESCE_WINDOW_MS, legacy fallback, precedence, 0/disable, non-numeric, negative, empty) and 9 unit tests for the deferred-ack flow in `dispatchJob` / `postDeferredAck` (Trello/JIRA/Linear pendingAck paths, payload-extractor → contextHint fallback, missing agentType, non-fatal `dispatchPMAck` rejection, undefined-result fallback to the pre-existing ackCommentId, Linear-without-workItemId skip, plus the non-pendingAck Linear routing branch that was wholly uncovered). Closes the codecov/patch gap on PR #1226 — the new BullMQ deferred-ack codepath added in this PR was at 15% (worker-entry.ts) / 0% (coalesce-config.ts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(router): capture coalesce-schedule failures to Sentry + rename ackMessage hint field Addresses both items from the (failed-to-submit) review run c63c7835: 1. **Should-fix: Sentry capture gap in coalesce-schedule catch block.** The `scheduleCoalescedJob` failure path in `webhook-processor.ts` logged via `logger.error` but did not call `captureException`. Other dispatch- failure paths flow through BullMQ retry → `worker.on('failed')` → `releaseLocksForFailedJob` → Sentry (per spec 015 plan 1), but this catch handles a Redis-side failure BEFORE the job is enqueued, so it bypasses that pipeline entirely. Redis hiccups during coalesced scheduling were silently escaping observability. Now captures under stable tag `coalesce_schedule_failure` with `{projectId, workItemId, agentType, coalesceKey, adapterType}` extras. 2. **Nit: rename `ackMessage` → `ackContextHint` on the new job interfaces.** The PR added an `ackMessage?: string` field on TrelloJob/JiraJob/LinearJob (and corresponding `*JobData` interfaces in worker-entry.ts) to store the `workItemTitle` as a context hint for `generateAckMessage()` at deferred- ack fire time — explicitly NOT the literal comment text, per the JSDoc. The name read like the literal text and required reading the JSDoc to avoid misuse. Renamed to `ackContextHint` for self-documentation. The pre-existing `GitHubJob.ackMessage` field is left unchanged because that one IS the literal comment text. Renamed 3 type fields (Trello/JIRA/Linear job interfaces in queue.ts + matching shapes in worker-entry.ts), 1 set site in webhook-processor.ts, 3 read sites in worker-entry.ts's deferred-ack helper invocations, plus the JSDoc reference and 4 test mocks (test names included). Full unit suite green (473 files / 8700 tests). Typecheck + lint clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(router): sanitize Docker-invalid chars in jobId when building worker container name Live regression on prod: PR #1226 introduced coalesced-job IDs shaped `coalesce:${projectId}:${workItemId}` (e.g. `coalesce:ucho:MNG-413`). `spawnWorker()` blindly concatenated this into the Docker container name (`cascade-worker-coalesce:ucho:MNG-413`), which Docker rejected with HTTP 400 "bad parameter — Invalid container name (...), only [a-zA-Z0-9][a-zA-Z0-9_.-] are allowed". Every coalesced job that fired post-deploy failed at `docker createContainer`, the worker never started, the run was lost. Two fixes considered: - Change the BullMQ jobId format upstream in `queue.ts` to use a Docker-safe separator. Risk: jobIds appear in logs / dedup keys / metrics; switching the canonical form widens the blast radius. - Sanitize at the spawn site only. Lower risk: the original jobId stays intact for logs and dedup; only the Docker-name derivation gets normalized. Chose the second. Replace any character not in `[a-zA-Z0-9_.-]` with `_`. The leading char is `cascade-worker-...` so the Docker-required leading `[a-zA-Z0-9]` is already satisfied. Tests: - New regression pin: jobId `coalesce:ucho:MNG-413` → container name `cascade-worker-coalesce_ucho_MNG-413` (the live failing case). - New positive pin: ordinary jobIds (alphanumeric + dashes/dots/underscores) pass through unchanged. Hotfix scope. Going straight to dev with a tight diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
🤖 Generated with Claude Code