Skip to content

fix(router): wrap PM-source dispatch in PM-provider scope so capacity gate enforces maxInFlightItems#1224

Merged
zbigniewsobiecki merged 4 commits intodevfrom
fix/capacity-gate-pm-scope
Apr 29, 2026
Merged

fix(router): wrap PM-source dispatch in PM-provider scope so capacity gate enforces maxInFlightItems#1224
zbigniewsobiecki merged 4 commits intodevfrom
fix/capacity-gate-pm-scope

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Summary

  • Plan 017/2 of spec 017 (router-side silent-failure hardening). Closes failure mode B from the 2026-04-29 log audit.
  • 32 silent skips/day on cascade-router (pipeline-capacity-gate: PM provider unavailable, allowing run) caused by the three PM router adapters wrapping trigger dispatch in their per-PM-type credential AsyncLocalStorage scope but NOT in PM-provider scope. The GitHub adapter at src/router/adapters/github.ts:280 already had both wrappings — the PM router adapters were the outliers.
  • Net effect before this fix: maxInFlightItems was silently no-op for the entire PM-source path. The very protection introduced after the prior multi-card-into-TODO incident was never actually enforced for Linear / Trello / JIRA status-changed triggers.
  • Three changes: (1) new shared helper withPMScopeForDispatch(project, dispatch) at src/router/adapters/_shared.ts mirroring the GitHub adapter's correct shape; (2) all three PM router adapters consume the helper; (3) the gate's "PM provider unavailable" branch flips from WARN + return false (allow) to ERROR + Sentry capture under stable tag pipeline_capacity_gate_no_pm_provider + return true (block). Once the routine path establishes scope, hitting that branch is a real anomaly, not steady-state noise.
  • Final PR for spec 017 — also marks docs/specs/017-router-silent-failure-hardening.md as .done in a trailing commit. All three plans (pm-ack-coverage, capacity-gate-pm-scope, progress-comment-double-delete) merged.

Regression nets

  • Static guard at tests/unit/integrations/pm-router-adapter-pm-scope.test.ts — reads each PM router adapter source and asserts it contains either withPMScopeForDispatch or withPMProvider. A future PM router adapter without the wrapping fails CI with a precise file path. Modeled on trigger-event-consistency.test.ts.
  • Helper coverage at tests/unit/router/adapters/with-pm-scope-for-dispatch.test.ts — 4 tests (happy path with getPMProvider() succeeding inside the wrapped callback, return passthrough, null result, error propagation).
  • Capacity-gate fail-closed test in pipeline-capacity-gate.test.ts — the existing "allows when no PM provider" test inverted to assert blocked === true + ERROR log + Sentry capture, plus 2 new positive-path regression pins so the over-capacity / below-capacity branches don't break during the migration.

Test plan

  • +4 tests on withPMScopeForDispatch.
  • +3 tests on the capacity gate's fail-closed semantics + positive paths.
  • +3 tests in the static guard (one per registered PM router adapter).
  • Three trigger-handler test files (linear-status-changed.test.ts, jira-status-changed.test.ts, status-changed.test.ts) gain vi.mock('.../pipeline-capacity-gate.js') passthrough — they were failing on the gate's new fail-closed semantics because they don't establish PM-provider scope. The gate's behavior is tested in its own dedicated test file.
  • Three PM router adapter test files (linear.test.ts, trello.test.ts, jira.test.ts) gain vi.mock('./_shared.js') passthrough — same reason; the helper is tested in its own dedicated test file.
  • npm test (full unit suite): 474 files / 8677 tests passing.
  • npm run typecheck clean.
  • npm run lint clean.
  • Pre-push lefthook (unit + integration) passed.
  • Post-deploy: a human moving N work items into TODO simultaneously against a project with maxInFlightItems: M produces exactly M implementation runs (verifies AC Improved performance and stability #4); 24h log volume of pipeline-capacity-gate: PM provider unavailable, allowing run drops to 0; new ERROR Sentry events under tag pipeline_capacity_gate_no_pm_provider represent real scope leaks worth investigating.

🤖 Generated with Claude Code

zbigniewsobiecki and others added 3 commits April 29, 2026 10:49
… gate enforces maxInFlightItems

Plan 017/2 (capacity-gate-pm-scope). Closes failure mode B from spec 017's
24h log audit on 2026-04-29.

Root cause: the pipeline-capacity gate at
`src/triggers/shared/pipeline-capacity-gate.ts:33-45` calls `getPMProvider()`
to count in-flight items. For every PM `status-changed` trigger this threw
`No PMProvider in scope` because the three PM router adapters
(`src/router/adapters/{linear,trello,jira}.ts`) wrapped trigger dispatch in
their per-PM-type credential AsyncLocalStorage scope (`withLinearCredentials`
/ `withTrelloCredentials` / `withJiraCredentials`) but NOT in PM-provider
scope. The GitHub adapter at `src/router/adapters/github.ts:280` already had
both wrappings — the PM router adapters were the outliers.

The gate caught the throw and fell through to its conservative branch:
`WARN: pipeline-capacity-gate: PM provider unavailable, allowing run` +
`return false` (allow). Net effect: `maxInFlightItems` was silently no-op for
the entire PM-source path. 32 occurrences/day on prod cascade-router. The
gate's whole purpose is preventing the multi-card-into-TODO incident class
documented in the file header — and the protection had been disabled since
the gate was first introduced.

Three changes:

1. New shared helper `withPMScopeForDispatch(project, dispatch)` at
   `src/router/adapters/_shared.ts`. Resolves the PM provider via
   `createPMProvider(project)` and wraps `dispatch` in `withPMProvider`.
   Mirrors the GitHub adapter's correct shape.

2. The three PM router adapters consume the helper inside
   `dispatchWithCredentials`, layering PM-provider scope on top of their
   existing per-PM-type credential scope.

3. The capacity gate's "PM provider unavailable" branch flips from
   `WARN + return false` (allow) to ERROR + Sentry capture under stable tag
   `pipeline_capacity_gate_no_pm_provider` + `return true` (block). Once the
   routine path establishes scope (changes 1+2 above), hitting that branch
   represents a real AsyncLocalStorage scope leak that operators need to
   investigate. Failing closed is preferable to silently failing open and
   re-introducing the original incident class.

Regression nets:

- `tests/unit/integrations/pm-router-adapter-pm-scope.test.ts`: static guard
  reading each PM router adapter source and asserting it contains either
  `withPMScopeForDispatch` or `withPMProvider`. A future PM router adapter
  added without the wrapping fails this guard with a precise file path.
  Modeled on `trigger-event-consistency.test.ts`.

- `tests/unit/router/adapters/with-pm-scope-for-dispatch.test.ts`: 4 tests
  covering happy path (PM provider resolves inside the wrapped callback),
  return-value passthrough, null dispatch result, and error propagation.

- `pipeline-capacity-gate.test.ts`: existing "allows when no PM provider"
  test inverted to "FAILS CLOSED (blocks) when no PM provider scope" (the
  headline behavioral change), plus 2 new positive-path regression pins.

Side effects on existing trigger tests: three trigger-handler test files
(`linear-status-changed.test.ts`, `jira-status-changed.test.ts`,
`status-changed.test.ts`) call `shouldBlockForPipelineCapacity` indirectly
via the trigger handlers. After the fail-closed flip, those handlers were
blocked because the unit tests don't establish PM-provider scope. Each test
file gains a `vi.mock('.../pipeline-capacity-gate.js')` passthrough so
trigger-logic assertions still run — the gate's behavior is tested in its
own dedicated test file. Three PM router adapter test files
(`linear.test.ts`, `trello.test.ts`, `jira.test.ts`) gain a similar
passthrough mock for `_shared.js:withPMScopeForDispatch` so they don't pull
the real PM manifest registry.

Doc updates:
- `CLAUDE.md` Architecture section gains a "Capacity-gate invariant"
  paragraph documenting the wrapping requirement, the fail-closed policy,
  the Sentry tag, and the static guard.
- `CHANGELOG.md` entry under spec 017.

Full suite: 474 files / 8677 tests passing (+9 from Plan 1's baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki requested a review from aaight April 29, 2026 11:05
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 93.54839% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/router/adapters/jira.ts 50.00% 1 Missing ⚠️
src/triggers/shared/pipeline-capacity-gate.ts 92.85% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Collaborator

@nhopeatall nhopeatall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — targeted, well-evidenced fix for a real production silent failure (32 skip/day on the capacity gate PM-source path). The withPMScopeForDispatch helper correctly mirrors the GitHub adapter existing pattern (pmRegistry.createProvider then withPMProvider), all three PM adapters are updated consistently, the fail-closed flip is justified by the scope guarantee, and the static guard plus dedicated helper tests plus trigger-test passthrough mocks form a thorough regression net. The integration-tests CI failure is infrastructure-level (requires real Postgres) and unrelated to this change.

🕵️ claude-code · claude-opus-4-6 · run details

…MProvider scope + plan 2 polish

CI integration-tests caught a regression I missed locally because lefthook's
pre-push only runs `npm run test:fast` (unit-only), not the full integration
suite. Four integration tests asserted `result?.agentType === 'implementation'`
but received `undefined` after plan 2's fail-closed flip, because they call
real trigger `.handle(ctx)` directly without establishing PM-provider scope —
the same class of breakage I fixed in unit tests with vi.mock passthroughs.

Failing tests:
  - tests/integration/trigger-registry.test.ts:323 (TrelloStatusChangedTodoTrigger.handle)
  - tests/integration/trigger-registry.test.ts:591 (registry.dispatch with implementation trigger)
  - tests/integration/pm-provider-switching.test.ts:258 (Trello implementation dispatch)
  - tests/integration/pm-provider-switching.test.ts:293 (JIRA implementation dispatch)

Integration tests don't mock the gate (by design), so the right fix is to
mirror production wrapping: each call site now wraps in
`withPMProvider(createPMProvider(project), () => trigger.handle(ctx))` —
which is exactly what `withPMScopeForDispatch` does in the PM router adapters.

The synthetic in-test trigger at trigger-registry.test.ts:117 doesn't call
the gate (it returns a hard-coded TriggerResult), so no scope wrap needed
there. Same for any non-implementation agent (planning / splitting / etc.)
since `SLOT_CONSUMING_AGENTS` only contains 'implementation'.

Also applies polish from local-review on the spec 017 work:

1. Reworded stale comment in `src/gadgets/sessionState.ts:186` — the comment
   claimed clearing `initialCommentId` made the post-agent callback "see
   null and short-circuit", but post-plan-3 the callback's actual gate is
   the `initialCommentIdConsumed` flag set on the line below. Comment now
   describes both the eager-clear AND the consumed-flag gate together.

2. Trimmed redundant 3-line wrapping comments to zero lines across the
   three PM router adapters (`linear.ts`, `trello.ts`, `jira.ts`). The
   shared helper's name (`withPMScopeForDispatch`) is self-documenting and
   CLAUDE.md's "Capacity-gate invariant" paragraph carries the rationale.
   Per CLAUDE.md "Default to writing no comments" guidance.

3. Tightened conformance harness assertion at `pm-conformance.test.ts`'s
   `dispatchPMAck reaches this provider without throwing` test. Previously
   the only `expect(...)` was conditional (`if (result !== undefined)`), so
   when result was undefined the test passed with zero assertions executed
   — a "tests that would still pass if the code under test were deleted"
   anti-pattern. Now uses `expect.assertions(1)` and asserts the result is
   either undefined OR has the expected AckResult shape, with a clear
   failure message naming the provider.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki merged commit 80421d4 into dev Apr 29, 2026
9 checks passed
@zbigniewsobiecki zbigniewsobiecki deleted the fix/capacity-gate-pm-scope branch April 29, 2026 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants