Skip to content

Merge dev → main: spec 017 router silent-failure hardening + alerting agent improvements#1225

Merged
zbigniewsobiecki merged 19 commits intomainfrom
dev
Apr 29, 2026
Merged

Merge dev → main: spec 017 router silent-failure hardening + alerting agent improvements#1225
zbigniewsobiecki merged 19 commits intomainfrom
dev

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Summary

Promotes two bundles from dev to main:

Spec 017 — Router-side silent-failure hardening (3 plans)

Three production hardening fixes derived from the 2026-04-29 24h log/webhook audit, each shipped as its own commit/PR per spec strategic decision #8:

  • fix(triggers): suppress redundant progress-comment DELETE after gadget cleanup #1222 — Plan 3: progress-comment double-deleteWARN: Failed to delete progress comment after agent success log spam (72/day). Adds initialCommentIdConsumed flag on session state so the post-agent cleanup hook skips when an in-run gadget already deleted the comment. githubClient.deletePRComment also treats HTTP 404 as success (defense in depth).
  • fix(router): consolidate PM-ack dispatch via manifest registry (Linear coverage) #1223 — Plan 1: PM-ack dispatch coverageWARN: Unknown PM type for PM-focused agent ack, skipping (24/day, all from Linear-based projects). Consolidates the two duplicate ack-dispatch helpers into one path that consumes the manifest registry directly. No per-PM-type literal branching anywhere on the dispatch surface. Conformance harness gains a per-provider dispatch coverage assertion.
  • fix(router): wrap PM-source dispatch in PM-provider scope so capacity gate enforces maxInFlightItems #1224 — Plan 2: pipeline-capacity gate PM-provider scopeWARN: pipeline-capacity-gate: PM provider unavailable, allowing run (32/day). Wraps every PM router adapter's dispatch in PM-provider AsyncLocalStorage scope (mirroring the GitHub adapter); flips the gate from fail-open WARN to fail-closed ERROR + Sentry capture. Restores maxInFlightItems enforcement for PM status-changed triggers.

Three new Sentry tags on cascade-router: pm_ack_unknown_pm_type, pipeline_capacity_gate_no_pm_provider. (Plan 3 logs at DEBUG, no new tag.)

CLAUDE.md gains a "Capacity-gate invariant" passage in Architecture; src/integrations/README.md gains a "PM-ack dispatch coverage invariant" subsection in the conformance harness section.

Alerting agent improvements (already on dev)

Test plan

🤖 Generated with Claude Code

Cascade Bot and others added 19 commits April 27, 2026 11:39
…tainer picker

Replace the non-existent `containers` capability (which caused NOT_IMPLEMENTED
errors for every provider) with provider-specific mappings: Trello→`boards`,
JIRA→`projects`, Linear→`teams`. Disable the Fetch button with an explanatory
tooltip for unknown providers.

Also simplify the `alertingResultsContainerId ?? undefined` ternary to a plain
`??` expression in promptContext.ts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-pr-comment runs

Both GitHub triggers set `workItemId` at the top level of their TriggerResult
but omitted it from `agentInput`. tryCreateRun reads workItemId from
agentInput, so 0/103 respond-to-review and 0/9 respond-to-pr-comment runs
had a non-null `agent_runs.work_item_id` — hiding every one of them from
the dashboard's work-item page (which filters by that column). Confirmed
on 2026-04-29 via `cascade runs list --project ucho`: four respond-to-review
runs for ucho/MNG-400 (PR #136) on 2026-04-28 were invisible despite firing
successfully.

The runtime safety net in `runAgentExecutionPipeline` was supposed to catch
this but only patched when re-resolution returned a value DIFFERENT from
`result.workItemId`. Since both triggers populated `result.workItemId`
correctly, the safety net no-op'd.

Three changes:

1. Add `workItemId` (shorthand) to the `agentInput` literal in
   `pr-review-submitted.ts:82` and `pr-comment-mention.ts:143`.

2. Widen the runtime patch in `agent-execution.ts:498` so it triggers on
   `agentInput.workItemId !== resolvedWorkItemId` rather than on
   `result.workItemId !== resolvedWorkItemId`. Now any future trigger that
   forgets the field still gets repaired before it reaches the DB.

3. Add `tests/unit/triggers/trigger-work-item-id-consistency.test.ts` —
   a static-analysis guard modeled on `trigger-event-consistency.test.ts`.
   Walks every TriggerHandler file via the TypeScript AST and asserts that
   any return literal with a top-level `workItemId` and a real `agentType`
   also includes `workItemId` inside `agentInput`. Fires loudly with
   `file:line` when violated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iew-work-item-id-missing

fix(triggers): persist workItemId on respond-to-review and respond-to-pr-comment runs
… level

- JIRA: switch from `projects` to `states` capability (fetches workflow
  status names, which is what backlogListId expects); pass configured
  projectKey as args.containerId; save item.name not numeric state ID
- Trello: disable the dropdown picker entirely — no `lists` discovery
  capability exists in the manifest; users must enter the list ID manually
- Linear: keep `teams` capability (team ID is already the correct value)

Extract ContainerInput helper component to keep AlertingTab complexity
below the project's cognitive-complexity limit. Pass pmConfig from
integration-form.tsx so JIRA states can be scoped to the right project.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three production hardening fixes derived from the 2026-04-29 24h log/webhook
audit. Discovered alongside PR #1220 (workItemId-on-respond-to-* fix).

Failure modes:
  A) Linear PM-ack silently skipped on PM-focused agents (24/day)
  B) Pipeline-capacity gate fails open on every PM status-changed (32/day)
  C) Progress-comment double-delete race produces 404 log spam (72/day)

Decomposed per spec strategic decision #8 into three independent plans
(none blocks another):
  1-pm-ack-coverage  (consolidate dispatch via manifest registry)
  2-capacity-gate-pm-scope  (shared adapter helper, fail-closed semantics)
  3-progress-comment-double-delete  (consumed-flag, gated fallback, 404-DEBUG)

This commit ships scaffolding only. Plan 3's implementation lands in this
branch; plans 1 and 2 follow as their own PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t mid-run cleanup

Plan 017/3 (progress-comment-double-delete). Closes failure mode C from spec
017's 24h log audit on 2026-04-29.

Root cause: deleteProgressCommentOnSuccess (post-agent hook) read
sessionState.initialCommentId, fell back to result.agentInput.ackCommentId
when session state was empty, and issued a DELETE. But "session state cleared
by an in-run gadget" was indistinguishable from "session state never
populated", so the fallback fired and re-deleted comments that gadgets had
already disposed of mid-run. GitHub returned 404 and produced 72 WARN entries
per day on cascade-router (no functional impact — the comment WAS deleted —
but the noise dominated WARN volume and obscured real failures).

Three changes:

1. New `initialCommentIdConsumed: boolean` flag on SessionStateData (default
   false). Both `deleteInitialComment` (gadget-driven) and
   `clearInitialComment` (sidecar-driven) set it to true after disposing of
   the comment. Distinguishes "had a comment, now gone" from "never had one".

2. `deleteProgressCommentOnSuccess` reads the flag first; if true, skips the
   entire deletion path including the legacy `agentInput.ackCommentId`
   fallback. The legacy fallback continues to work for paths that never
   populated session state (consumed stays false).

3. `githubClient.deletePRComment` treats HTTP 404 as success-equivalent under
   RFC-7231 idempotency: returns without throwing, logs at DEBUG with the
   comment id. Other HTTP errors (5xx, 401, network) continue to throw. Defense
   in depth: even if a future regression of similar shape recurs, the WARN
   volume stays clean while a DEBUG breadcrumb persists for audit.

Test coverage: +7 tests in sessionState (new `initialCommentIdConsumed flag`
describe block), +3 in ack-comments (consumed-bypass + headline regression
pin + legacy-fallback preservation), +4 in client (404 → DEBUG + 5xx/401/network
rejection regression pins). Full suite: 470 files / 8649 tests passing
(+14 from baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nt-double-delete

fix(triggers): suppress redundant progress-comment DELETE after gadget cleanup
…re Linear coverage

Plan 017/1 (pm-ack-coverage). Closes failure mode A from spec 017's 24h log
audit on 2026-04-29: PM-focused agents (e.g. backlog-manager) triggered from
a GitHub webhook against Linear-based projects silently skipped their PM-side
ack. The router-adapter's local `postPMAck` helper had `if (pmType === 'trello')`
/ `if (pmType === 'jira')` literal branches but no Linear case, so control fell
through to `WARN: Unknown PM type for PM-focused agent ack, skipping` — 24×
per day on prod cascade-router, all from `ucho`. A near-identical helper at
`src/triggers/shared/pm-ack.ts:postPMAckComment` had the Linear branch, so
this was pure parallel-path drift between two helpers that should have been
one (same shape as PR #1220).

Three changes:

1. New consolidated helper `dispatchPMAck` at `src/router/pm-ack-dispatch.ts`.
   Indexes the manifest registry directly via
   `getPMProvider(pmType).platformClientFactory(projectId).postComment(...)`.
   Zero per-PM-type literal branching anywhere on the dispatch surface.
   Adding a future PM provider lands the dispatch path for free.

2. Both legacy call sites delegate:
   - `src/router/adapters/github.ts:postPMAck` (the buggy one)
   - `src/triggers/shared/pm-ack.ts:postPMAckComment` (had Linear; preserves
     the existing `string | null` return contract via String() normalization)

3. The "Unknown PM type" branch converts from silent WARN+skip to ERROR-level
   log + Sentry capture under stable tag `pm_ack_unknown_pm_type`. Mirrors
   the spec-015 `wedged_lock_canary` precedent. Once the consolidation ships,
   hitting that branch represents a real configuration error (project pinned
   to a deleted provider), not steady-state noise.

Regression nets:

- Per-provider conformance assertion: the existing PM manifest harness gains
  one new `it()` inside its `describe.each` block —
  `dispatchPMAck reaches this provider without throwing`. Adding a future
  provider whose `platformClientFactory` is misconfigured fails CI loudly.

- Static guard against future literal-branching drift:
  `tests/unit/router/pm-ack-dispatch.test.ts` reads each of the three call
  sites' source and asserts no `pmType === 'trello' | 'jira' | 'linear'`
  patterns appear within their bodies. Modeled on
  `trigger-event-consistency.test.ts`. A future maintainer who hand-codes a
  branch fails this guard with a precise file:line citation.

Test coverage: +6 tests on `dispatchPMAck` (Trello/JIRA/Linear happy paths,
null-from-postComment, unknown-pmType Sentry path, undefined-pmType same),
+4 on `postPMAckComment` delegation contract, +1 per-provider assertion in
the conformance harness (4 currently-registered providers including the test
fixture), +1 Linear regression pin on the github router adapter's `postAck`
public surface, plus 2 existing github adapter tests migrated from mocking
`postTrelloAck` to mocking `dispatchPMAck` (the dependency direction
changed). Full suite: 472 files / 8668 tests (+19 from baseline).

Doc updates:
- `src/integrations/README.md` gains a "PM-ack dispatch coverage invariant"
  subsection in the conformance harness section, documenting the no-literal-
  branching rule, the Sentry-tag escalation, and pointing at both regression
  nets.
- `CHANGELOG.md` entry under spec 017.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(router): consolidate PM-ack dispatch via manifest registry (Linear coverage)
… gate enforces maxInFlightItems

Plan 017/2 (capacity-gate-pm-scope). Closes failure mode B from spec 017's
24h log audit on 2026-04-29.

Root cause: the pipeline-capacity gate at
`src/triggers/shared/pipeline-capacity-gate.ts:33-45` calls `getPMProvider()`
to count in-flight items. For every PM `status-changed` trigger this threw
`No PMProvider in scope` because the three PM router adapters
(`src/router/adapters/{linear,trello,jira}.ts`) wrapped trigger dispatch in
their per-PM-type credential AsyncLocalStorage scope (`withLinearCredentials`
/ `withTrelloCredentials` / `withJiraCredentials`) but NOT in PM-provider
scope. The GitHub adapter at `src/router/adapters/github.ts:280` already had
both wrappings — the PM router adapters were the outliers.

The gate caught the throw and fell through to its conservative branch:
`WARN: pipeline-capacity-gate: PM provider unavailable, allowing run` +
`return false` (allow). Net effect: `maxInFlightItems` was silently no-op for
the entire PM-source path. 32 occurrences/day on prod cascade-router. The
gate's whole purpose is preventing the multi-card-into-TODO incident class
documented in the file header — and the protection had been disabled since
the gate was first introduced.

Three changes:

1. New shared helper `withPMScopeForDispatch(project, dispatch)` at
   `src/router/adapters/_shared.ts`. Resolves the PM provider via
   `createPMProvider(project)` and wraps `dispatch` in `withPMProvider`.
   Mirrors the GitHub adapter's correct shape.

2. The three PM router adapters consume the helper inside
   `dispatchWithCredentials`, layering PM-provider scope on top of their
   existing per-PM-type credential scope.

3. The capacity gate's "PM provider unavailable" branch flips from
   `WARN + return false` (allow) to ERROR + Sentry capture under stable tag
   `pipeline_capacity_gate_no_pm_provider` + `return true` (block). Once the
   routine path establishes scope (changes 1+2 above), hitting that branch
   represents a real AsyncLocalStorage scope leak that operators need to
   investigate. Failing closed is preferable to silently failing open and
   re-introducing the original incident class.

Regression nets:

- `tests/unit/integrations/pm-router-adapter-pm-scope.test.ts`: static guard
  reading each PM router adapter source and asserting it contains either
  `withPMScopeForDispatch` or `withPMProvider`. A future PM router adapter
  added without the wrapping fails this guard with a precise file path.
  Modeled on `trigger-event-consistency.test.ts`.

- `tests/unit/router/adapters/with-pm-scope-for-dispatch.test.ts`: 4 tests
  covering happy path (PM provider resolves inside the wrapped callback),
  return-value passthrough, null dispatch result, and error propagation.

- `pipeline-capacity-gate.test.ts`: existing "allows when no PM provider"
  test inverted to "FAILS CLOSED (blocks) when no PM provider scope" (the
  headline behavioral change), plus 2 new positive-path regression pins.

Side effects on existing trigger tests: three trigger-handler test files
(`linear-status-changed.test.ts`, `jira-status-changed.test.ts`,
`status-changed.test.ts`) call `shouldBlockForPipelineCapacity` indirectly
via the trigger handlers. After the fail-closed flip, those handlers were
blocked because the unit tests don't establish PM-provider scope. Each test
file gains a `vi.mock('.../pipeline-capacity-gate.js')` passthrough so
trigger-logic assertions still run — the gate's behavior is tested in its
own dedicated test file. Three PM router adapter test files
(`linear.test.ts`, `trello.test.ts`, `jira.test.ts`) gain a similar
passthrough mock for `_shared.js:withPMScopeForDispatch` so they don't pull
the real PM manifest registry.

Doc updates:
- `CLAUDE.md` Architecture section gains a "Capacity-gate invariant"
  paragraph documenting the wrapping requirement, the fail-closed policy,
  the Sentry tag, and the static guard.
- `CHANGELOG.md` entry under spec 017.

Full suite: 474 files / 8677 tests passing (+9 from Plan 1's baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…esults-container-id

feat(alerting): add configurable PM results list for alerting agent + enable UX
…MProvider scope + plan 2 polish

CI integration-tests caught a regression I missed locally because lefthook's
pre-push only runs `npm run test:fast` (unit-only), not the full integration
suite. Four integration tests asserted `result?.agentType === 'implementation'`
but received `undefined` after plan 2's fail-closed flip, because they call
real trigger `.handle(ctx)` directly without establishing PM-provider scope —
the same class of breakage I fixed in unit tests with vi.mock passthroughs.

Failing tests:
  - tests/integration/trigger-registry.test.ts:323 (TrelloStatusChangedTodoTrigger.handle)
  - tests/integration/trigger-registry.test.ts:591 (registry.dispatch with implementation trigger)
  - tests/integration/pm-provider-switching.test.ts:258 (Trello implementation dispatch)
  - tests/integration/pm-provider-switching.test.ts:293 (JIRA implementation dispatch)

Integration tests don't mock the gate (by design), so the right fix is to
mirror production wrapping: each call site now wraps in
`withPMProvider(createPMProvider(project), () => trigger.handle(ctx))` —
which is exactly what `withPMScopeForDispatch` does in the PM router adapters.

The synthetic in-test trigger at trigger-registry.test.ts:117 doesn't call
the gate (it returns a hard-coded TriggerResult), so no scope wrap needed
there. Same for any non-implementation agent (planning / splitting / etc.)
since `SLOT_CONSUMING_AGENTS` only contains 'implementation'.

Also applies polish from local-review on the spec 017 work:

1. Reworded stale comment in `src/gadgets/sessionState.ts:186` — the comment
   claimed clearing `initialCommentId` made the post-agent callback "see
   null and short-circuit", but post-plan-3 the callback's actual gate is
   the `initialCommentIdConsumed` flag set on the line below. Comment now
   describes both the eager-clear AND the consumed-flag gate together.

2. Trimmed redundant 3-line wrapping comments to zero lines across the
   three PM router adapters (`linear.ts`, `trello.ts`, `jira.ts`). The
   shared helper's name (`withPMScopeForDispatch`) is self-documenting and
   CLAUDE.md's "Capacity-gate invariant" paragraph carries the rationale.
   Per CLAUDE.md "Default to writing no comments" guidance.

3. Tightened conformance harness assertion at `pm-conformance.test.ts`'s
   `dispatchPMAck reaches this provider without throwing` test. Previously
   the only `expect(...)` was conditional (`if (result !== undefined)`), so
   when result was undefined the test passed with zero assertions executed
   — a "tests that would still pass if the code under test were deleted"
   anti-pattern. Now uses `expect.assertions(1)` and asserts the result is
   either undefined OR has the expected AckResult shape, with a clear
   failure message naming the provider.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pm-scope

fix(router): wrap PM-source dispatch in PM-provider scope so capacity gate enforces maxInFlightItems
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 97.08738% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/router/adapters/github.ts 80.00% 1 Missing ⚠️
src/router/adapters/jira.ts 50.00% 1 Missing ⚠️
src/triggers/shared/pipeline-capacity-gate.ts 92.85% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@zbigniewsobiecki zbigniewsobiecki merged commit ec965f7 into main Apr 29, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant