fix(triggers): enforce maxInFlightItems on PM status-changed (not just backlog-manager) by zbigniewsobiecki · Pull Request #1181 · mongrel-intelligence/cascade

zbigniewsobiecki · 2026-04-24T11:20:41Z

Problem

maxInFlightItems was only enforced at the two backlog-manager chain sites (src/triggers/github/pr-merged.ts:89, src/triggers/shared/agent-execution.ts:728). PM status-changed triggers (Trello / JIRA / Linear) returned a TriggerResult for implementation for every card moved into TODO with no capacity check.

Result on prod ua-store (org under-armour): maxInFlightItems: 1 configured, but 3 implementations running concurrently after a human pushed several JIRA issues into "To Do" in quick succession. Router logs confirmed zero isPipelineAtCapacity entries on the path that actually enqueued the jobs.

The router-side isWorkItemLocked only blocks duplicates of the same (project, workItem, agentType) tuple — it does not enforce a pipeline-wide cap.

Fix

Add a shared gate (src/triggers/shared/pipeline-capacity-gate.ts) called from each PM status-changed handler. The gate:

Only fires for implementation — per STATUS_TO_AGENT, it's the only PM-status-reachable agent that consumes a TODO/IN_PROGRESS/IN_REVIEW slot. splitting and planning use their own dedicated columns; backlog-manager already has its dedicated gates.
Resolves the PM provider via the existing getPMProvider() AsyncLocalStorage scope. If no scope is available (defensive), allows.
Calls a new isActivePipelineOverCapacity(project, provider, { excludeWorkItemId }) in backlog-check.ts. Differs from the existing isPipelineAtCapacity in two ways:
1. No backlog-empty short-circuit — irrelevant when the question is "is the active pipeline already too full?".
2. Excludes the just-moved card from the count — without this, the first card moved to TODO with limit=1 would self-block.
On over-capacity, returns null from the trigger handler. The card sits in its column until a slot frees, at which point the existing pr-merged → backlog-manager chain picks it up. Mirrors how the backlog-manager gate behaves today.

The two pre-existing duplicate gates in pr-merged.ts and agent-execution.ts are intentionally left in place as defense-in-depth.

Test plan

Unit: 8 new tests for isActivePipelineOverCapacity (over/below capacity, empty-backlog non-short-circuit, excludeWorkItemId across columns, error fallback, misconfigured fallback, default limit=1)
Unit: 4 new tests for shouldBlockForPipelineCapacity (non-slot agents pass through, blocks over capacity, allows below capacity, allows when no PM scope)
All 901 trigger-suite tests still pass
All 8363 unit tests + lint + typecheck pass
CI green
After merge: drop the temporary --max-concurrency 1 mitigation applied to ua-store implementation agent

Immediate prod mitigation already applied

cascade agents update 69 --org under-armour --max-concurrency 1 caps parallel implementation runs on ua-store via the existing per-agent-type column. Stops the bleeding while this lands.

🤖 Generated with Claude Code

…t backlog-manager) `maxInFlightItems` was only consulted at the two backlog-manager chain sites (pr-merged, splitting auto-chain). PM `status-changed` triggers fired `implementation` for every card moved into TODO with no capacity check, so a human pushing N cards burst N parallel implementations straight past the cap. Observed on prod ua-store with limit=1 → 3 implementations running concurrently. This adds a shared gate (`shouldBlockForPipelineCapacity`) called from the JIRA / Linear / Trello status-changed handlers. Only `implementation` is gated — it's the only PM-status-reachable agent that consumes a TODO/ IN_PROGRESS/IN_REVIEW slot per `STATUS_TO_AGENT`. When the active pipeline (excluding the just-moved card) is at the cap, the trigger returns null and the card sits in its column until a slot frees, at which point the existing `pr-merged → backlog-manager` chain picks it up. Mirrors the behavior the backlog-manager gate already has. Adds `isActivePipelineOverCapacity` next to the existing `isPipelineAtCapacity` — same misconfigured/error fallbacks, but no backlog-empty short-circuit (irrelevant here) and an `excludeWorkItemId` arg so the card whose move just fired the webhook isn't double-counted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-04-24T11:24:42Z

Codecov Report

❌ Patch coverage is 96.26168% with 4 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/triggers/jira/status-changed.ts	90.90%	1 Missing ⚠️
src/triggers/linear/status-changed.ts	90.90%	1 Missing ⚠️
src/triggers/shared/backlog-check.ts	97.43%	0 Missing and 1 partial ⚠️
src/triggers/trello/status-changed.ts	90.90%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

nhopeatall

Summary

[BLOCKING] This PR introduces a permanent pipeline deadlock when users move multiple cards to TODO manually.

Architecture & Design

[BLOCKING] Pipeline Deadlock (Cards left in TODO permanently inflate capacity): By returning null from the status-changed trigger when over capacity, the card is left in the PM's TODO column without an active agent running. Because isActivePipelineOverCapacity counts all cards in TODO + IN_PROGRESS + IN_REVIEW to determine inFlightCount, these "dead" cards permanently inflate the capacity count.
- Scenario: maxInFlightItems=1. A human moves Card 2 to TODO while Card 1 is IN_PROGRESS.
- Result: The webhook for Card 2 fires, sees Card 1 in IN_PROGRESS, and returns null. Card 2 stays in TODO.
- Deadlock: When Card 1 finishes, inFlightCount becomes 1 (because Card 2 is still in TODO). The pr-merged -> backlog-manager chain runs, sees inFlightCount >= limit, and exits. Card 2 never starts, and backlog-manager will never pull new cards from BACKLOG. The pipeline is permanently bricked until a human moves Card 2 back to BACKLOG.
[BLOCKING] Race Condition Deadlock when moving N > limit cards: If a human moves 3 cards to TODO simultaneously (e.g. limit=1).
- The webhook for Card 1 excludes Card 1, sees Card 2 and 3 in TODO. inFlightCount = 2 >= limit. Returns null.
- The webhook for Card 2 excludes Card 2, sees Card 1 and 3. inFlightCount = 2 >= limit. Returns null.
- The webhook for Card 3 excludes Card 3, sees Card 1 and 2. Returns null.
- Result: 0 agents start, 3 dead cards left in TODO, pipeline capacity permanently set to 3.

The PR description claims: "The card sits in its column until a slot frees, at which point the existing pr-merged -> backlog-manager chain picks it up." This is factually incorrect based on how backlog-manager works. backlog-manager only pulls new cards from BACKLOG; it does not start agents for cards already sitting in TODO.

Recommended Fix:
To safely enforce capacity at the status-changed gate, you must reject the PM move so that inFlightCount remains accurate. Instead of just returning null, use the provider to move the card back to the BACKLOG column and optionally post a comment to the PM card explaining it was rejected due to pipeline capacity.

🕵️ opencode · google/gemini-3.1-pro-preview · run details

aaight · 2026-04-24T11:39:18Z

Thanks for the thorough analysis @nhopeatall. You're correct — the original approach introduced exactly the deadlocks you described.

Fix applied: When the capacity gate rejects a card, it now calls provider.moveWorkItem(workItemId, backlogDestination) to push it back to BACKLOG, then posts an explanatory comment. This resolves both scenarios:

Single-card deadlock: The rejected card returns to BACKLOG, so inFlightCount stays accurate. When the in-flight item finishes and backlog-manager runs, it sees the card in BACKLOG (not TODO) and picks it up normally.
N > limit race condition: Each webhook independently detects over-capacity and moves its card back to BACKLOG. The first webhook that arrives when a slot is available gets through; the rest are queued in BACKLOG for backlog-manager to sequence.

The PR description's claim about backlog-manager picking up TODO cards was incorrect — corrected in the commit.

The rejectToBacklog helper is wrapped in try/catch so API failures are non-fatal (logged as warn); the gate always returns true on over-capacity regardless.

🕵️ claude-code · claude-sonnet-4-6 · run details

zbigniewsobiecki merged commit 6b9ac53 into dev Apr 24, 2026
8 checks passed

zbigniewsobiecki deleted the fix/pipeline-capacity-on-pm-status-changed branch April 24, 2026 11:24

zbigniewsobiecki mentioned this pull request Apr 24, 2026

chore: promote dev → main (pipeline-capacity gate on PM status-changed) #1182

Merged

5 tasks

nhopeatall requested changes Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(triggers): enforce maxInFlightItems on PM status-changed (not just backlog-manager)#1181

fix(triggers): enforce maxInFlightItems on PM status-changed (not just backlog-manager)#1181
zbigniewsobiecki merged 1 commit intodevfrom
fix/pipeline-capacity-on-pm-status-changed

zbigniewsobiecki commented Apr 24, 2026

Uh oh!

Uh oh!

codecov Bot commented Apr 24, 2026

Uh oh!

nhopeatall left a comment

Uh oh!

aaight commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zbigniewsobiecki commented Apr 24, 2026

Problem

Fix

Test plan

Immediate prod mitigation already applied

Uh oh!

Uh oh!

codecov Bot commented Apr 24, 2026

Codecov Report

Uh oh!

nhopeatall left a comment

Choose a reason for hiding this comment

Summary

Architecture & Design

Uh oh!

aaight commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants