fix(router): wrap PM-source dispatch in PM-provider scope so capacity gate enforces maxInFlightItems#1224
Conversation
… gate enforces maxInFlightItems
Plan 017/2 (capacity-gate-pm-scope). Closes failure mode B from spec 017's
24h log audit on 2026-04-29.
Root cause: the pipeline-capacity gate at
`src/triggers/shared/pipeline-capacity-gate.ts:33-45` calls `getPMProvider()`
to count in-flight items. For every PM `status-changed` trigger this threw
`No PMProvider in scope` because the three PM router adapters
(`src/router/adapters/{linear,trello,jira}.ts`) wrapped trigger dispatch in
their per-PM-type credential AsyncLocalStorage scope (`withLinearCredentials`
/ `withTrelloCredentials` / `withJiraCredentials`) but NOT in PM-provider
scope. The GitHub adapter at `src/router/adapters/github.ts:280` already had
both wrappings — the PM router adapters were the outliers.
The gate caught the throw and fell through to its conservative branch:
`WARN: pipeline-capacity-gate: PM provider unavailable, allowing run` +
`return false` (allow). Net effect: `maxInFlightItems` was silently no-op for
the entire PM-source path. 32 occurrences/day on prod cascade-router. The
gate's whole purpose is preventing the multi-card-into-TODO incident class
documented in the file header — and the protection had been disabled since
the gate was first introduced.
Three changes:
1. New shared helper `withPMScopeForDispatch(project, dispatch)` at
`src/router/adapters/_shared.ts`. Resolves the PM provider via
`createPMProvider(project)` and wraps `dispatch` in `withPMProvider`.
Mirrors the GitHub adapter's correct shape.
2. The three PM router adapters consume the helper inside
`dispatchWithCredentials`, layering PM-provider scope on top of their
existing per-PM-type credential scope.
3. The capacity gate's "PM provider unavailable" branch flips from
`WARN + return false` (allow) to ERROR + Sentry capture under stable tag
`pipeline_capacity_gate_no_pm_provider` + `return true` (block). Once the
routine path establishes scope (changes 1+2 above), hitting that branch
represents a real AsyncLocalStorage scope leak that operators need to
investigate. Failing closed is preferable to silently failing open and
re-introducing the original incident class.
Regression nets:
- `tests/unit/integrations/pm-router-adapter-pm-scope.test.ts`: static guard
reading each PM router adapter source and asserting it contains either
`withPMScopeForDispatch` or `withPMProvider`. A future PM router adapter
added without the wrapping fails this guard with a precise file path.
Modeled on `trigger-event-consistency.test.ts`.
- `tests/unit/router/adapters/with-pm-scope-for-dispatch.test.ts`: 4 tests
covering happy path (PM provider resolves inside the wrapped callback),
return-value passthrough, null dispatch result, and error propagation.
- `pipeline-capacity-gate.test.ts`: existing "allows when no PM provider"
test inverted to "FAILS CLOSED (blocks) when no PM provider scope" (the
headline behavioral change), plus 2 new positive-path regression pins.
Side effects on existing trigger tests: three trigger-handler test files
(`linear-status-changed.test.ts`, `jira-status-changed.test.ts`,
`status-changed.test.ts`) call `shouldBlockForPipelineCapacity` indirectly
via the trigger handlers. After the fail-closed flip, those handlers were
blocked because the unit tests don't establish PM-provider scope. Each test
file gains a `vi.mock('.../pipeline-capacity-gate.js')` passthrough so
trigger-logic assertions still run — the gate's behavior is tested in its
own dedicated test file. Three PM router adapter test files
(`linear.test.ts`, `trello.test.ts`, `jira.test.ts`) gain a similar
passthrough mock for `_shared.js:withPMScopeForDispatch` so they don't pull
the real PM manifest registry.
Doc updates:
- `CLAUDE.md` Architecture section gains a "Capacity-gate invariant"
paragraph documenting the wrapping requirement, the fail-closed policy,
the Sentry tag, and the static guard.
- `CHANGELOG.md` entry under spec 017.
Full suite: 474 files / 8677 tests passing (+9 from Plan 1's baseline).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…failure hardening)
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
nhopeatall
left a comment
There was a problem hiding this comment.
LGTM — targeted, well-evidenced fix for a real production silent failure (32 skip/day on the capacity gate PM-source path). The withPMScopeForDispatch helper correctly mirrors the GitHub adapter existing pattern (pmRegistry.createProvider then withPMProvider), all three PM adapters are updated consistently, the fail-closed flip is justified by the scope guarantee, and the static guard plus dedicated helper tests plus trigger-test passthrough mocks form a thorough regression net. The integration-tests CI failure is infrastructure-level (requires real Postgres) and unrelated to this change.
🕵️ claude-code · claude-opus-4-6 · run details
…MProvider scope + plan 2 polish CI integration-tests caught a regression I missed locally because lefthook's pre-push only runs `npm run test:fast` (unit-only), not the full integration suite. Four integration tests asserted `result?.agentType === 'implementation'` but received `undefined` after plan 2's fail-closed flip, because they call real trigger `.handle(ctx)` directly without establishing PM-provider scope — the same class of breakage I fixed in unit tests with vi.mock passthroughs. Failing tests: - tests/integration/trigger-registry.test.ts:323 (TrelloStatusChangedTodoTrigger.handle) - tests/integration/trigger-registry.test.ts:591 (registry.dispatch with implementation trigger) - tests/integration/pm-provider-switching.test.ts:258 (Trello implementation dispatch) - tests/integration/pm-provider-switching.test.ts:293 (JIRA implementation dispatch) Integration tests don't mock the gate (by design), so the right fix is to mirror production wrapping: each call site now wraps in `withPMProvider(createPMProvider(project), () => trigger.handle(ctx))` — which is exactly what `withPMScopeForDispatch` does in the PM router adapters. The synthetic in-test trigger at trigger-registry.test.ts:117 doesn't call the gate (it returns a hard-coded TriggerResult), so no scope wrap needed there. Same for any non-implementation agent (planning / splitting / etc.) since `SLOT_CONSUMING_AGENTS` only contains 'implementation'. Also applies polish from local-review on the spec 017 work: 1. Reworded stale comment in `src/gadgets/sessionState.ts:186` — the comment claimed clearing `initialCommentId` made the post-agent callback "see null and short-circuit", but post-plan-3 the callback's actual gate is the `initialCommentIdConsumed` flag set on the line below. Comment now describes both the eager-clear AND the consumed-flag gate together. 2. Trimmed redundant 3-line wrapping comments to zero lines across the three PM router adapters (`linear.ts`, `trello.ts`, `jira.ts`). The shared helper's name (`withPMScopeForDispatch`) is self-documenting and CLAUDE.md's "Capacity-gate invariant" paragraph carries the rationale. Per CLAUDE.md "Default to writing no comments" guidance. 3. Tightened conformance harness assertion at `pm-conformance.test.ts`'s `dispatchPMAck reaches this provider without throwing` test. Previously the only `expect(...)` was conditional (`if (result !== undefined)`), so when result was undefined the test passed with zero assertions executed — a "tests that would still pass if the code under test were deleted" anti-pattern. Now uses `expect.assertions(1)` and asserts the result is either undefined OR has the expected AckResult shape, with a clear failure message naming the provider. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
pipeline-capacity-gate: PM provider unavailable, allowing run) caused by the three PM router adapters wrapping trigger dispatch in their per-PM-type credentialAsyncLocalStoragescope but NOT in PM-provider scope. The GitHub adapter atsrc/router/adapters/github.ts:280already had both wrappings — the PM router adapters were the outliers.maxInFlightItemswas silently no-op for the entire PM-source path. The very protection introduced after the prior multi-card-into-TODO incident was never actually enforced for Linear / Trello / JIRAstatus-changedtriggers.withPMScopeForDispatch(project, dispatch)atsrc/router/adapters/_shared.tsmirroring the GitHub adapter's correct shape; (2) all three PM router adapters consume the helper; (3) the gate's "PM provider unavailable" branch flips fromWARN + return false(allow) to ERROR + Sentry capture under stable tagpipeline_capacity_gate_no_pm_provider+return true(block). Once the routine path establishes scope, hitting that branch is a real anomaly, not steady-state noise.docs/specs/017-router-silent-failure-hardening.mdas.donein a trailing commit. All three plans (pm-ack-coverage,capacity-gate-pm-scope,progress-comment-double-delete) merged.Regression nets
tests/unit/integrations/pm-router-adapter-pm-scope.test.ts— reads each PM router adapter source and asserts it contains eitherwithPMScopeForDispatchorwithPMProvider. A future PM router adapter without the wrapping fails CI with a precise file path. Modeled ontrigger-event-consistency.test.ts.tests/unit/router/adapters/with-pm-scope-for-dispatch.test.ts— 4 tests (happy path withgetPMProvider()succeeding inside the wrapped callback, return passthrough, null result, error propagation).pipeline-capacity-gate.test.ts— the existing "allows when no PM provider" test inverted to assertblocked === true+ ERROR log + Sentry capture, plus 2 new positive-path regression pins so the over-capacity / below-capacity branches don't break during the migration.Test plan
withPMScopeForDispatch.linear-status-changed.test.ts,jira-status-changed.test.ts,status-changed.test.ts) gainvi.mock('.../pipeline-capacity-gate.js')passthrough — they were failing on the gate's new fail-closed semantics because they don't establish PM-provider scope. The gate's behavior is tested in its own dedicated test file.linear.test.ts,trello.test.ts,jira.test.ts) gainvi.mock('./_shared.js')passthrough — same reason; the helper is tested in its own dedicated test file.npm test(full unit suite): 474 files / 8677 tests passing.npm run typecheckclean.npm run lintclean.maxInFlightItems: Mproduces exactly M implementation runs (verifies AC Improved performance and stability #4); 24h log volume ofpipeline-capacity-gate: PM provider unavailable, allowing rundrops to 0; new ERROR Sentry events under tagpipeline_capacity_gate_no_pm_providerrepresent real scope leaks worth investigating.🤖 Generated with Claude Code