Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ All notable user-visible changes to CASCADE are documented here. The format is l

### Changed

- **Pipeline-capacity gate now enforces `maxInFlightItems` for PM `status-changed` triggers** (spec 017, plan 2 of 3). The gate at `src/triggers/shared/pipeline-capacity-gate.ts` is the hard cap on the active pipeline (TODO + IN_PROGRESS + IN_REVIEW work items) introduced after a prior incident where a human moved three cards into TODO simultaneously and three concurrent implementation runs fired against a project pinned to `maxInFlightItems: 1`. The gate calls `getPMProvider()` to count in-flight items, but for every PM `status-changed` trigger the call threw `No PMProvider in scope` because the three PM router adapters (`src/router/adapters/{linear,trello,jira}.ts`) wrapped trigger dispatch in their per-PM-type credential `AsyncLocalStorage` scope but NOT in PM-provider scope (the GitHub adapter at `src/router/adapters/github.ts:280` already had both wrappings). The gate fell through to its conservative branch (`WARN: pipeline-capacity-gate: PM provider unavailable, allowing run` and `return false`) — silently no-op for the only triggers that actually need it. 32 occurrences/day on cascade-router (verified 2026-04-29). The fix introduces a shared helper `withPMScopeForDispatch(project, dispatch)` at `src/router/adapters/_shared.ts` that the three PM router adapters consume, mirroring the GitHub adapter's correct shape. The gate's "PM provider unavailable" branch is converted from `WARN + return false` (allow) to ERROR-level + Sentry capture under stable tag `pipeline_capacity_gate_no_pm_provider` + `return true` (block) — once the routine path establishes scope, hitting that branch is a real `AsyncLocalStorage` scope leak operators need to investigate. A static-guard test at `tests/unit/integrations/pm-router-adapter-pm-scope.test.ts` enforces the wrapping invariant per adapter; CLAUDE.md gains a "Capacity-gate invariant" passage in the Architecture section. See [spec 017](docs/specs/017-router-silent-failure-hardening.md).
- **PM-ack dispatch consolidation: Linear-based PM-focused agents now post their PM-side ack comment** (spec 017, plan 1 of 3). PM-focused agents (e.g. `backlog-manager`) triggered from a GitHub webhook used to silently skip their PM-side ack on Linear projects: the router-adapter's local `postPMAck` helper had `if (pmType === 'trello')` / `if (pmType === 'jira')` branches but no Linear branch, so Linear-based projects fell through to a `WARN: Unknown PM type for PM-focused agent ack, skipping` and never saw the "🔧 On it" comment that Trello/JIRA projects got (24 silent skips per day on cascade-router, all from `ucho`, verified 2026-04-29). A near-identical helper at `src/triggers/shared/pm-ack.ts` already had the Linear branch — pure parallel-path drift. The fix introduces a single consolidated helper `dispatchPMAck` at `src/router/pm-ack-dispatch.ts` that indexes the manifest registry directly and invokes `manifest.platformClientFactory(projectId).postComment(...)` — no per-PM-type literal branching anywhere on the dispatch surface. Both legacy call sites delegate. The PM manifest conformance harness gains a per-provider `dispatchPMAck reaches this provider without throwing` assertion, and a static-guard test pins "no `pmType === '<literal>'` branching" against all three call sites; adding a future PM provider to the registry lands the dispatch path for free. Genuinely-unknown PM types (configuration error: project pinned to a deleted provider) now log at ERROR + capture to Sentry under stable tag `pm_ack_unknown_pm_type` instead of a silent WARN. See [spec 017](docs/specs/017-router-silent-failure-hardening.md).
- **Progress-comment lifecycle: post-agent cleanup hook now skips when an in-run gadget already deleted the comment** (spec 017, plan 3 of 3). The post-agent `deleteProgressCommentOnSuccess` hook used to read `sessionState.initialCommentId`, fall back to `result.agentInput.ackCommentId` when session state was empty, and issue a redundant DELETE — but "session state cleared by a gadget" was indistinguishable from "session state never populated", so the fallback fired and re-deleted comments that were already gone. GitHub returned 404 and `WARN: Failed to delete progress comment after agent success` was logged 72 times per day on cascade-router (live audit on 2026-04-29). Adds an explicit `initialCommentIdConsumed: boolean` flag on `SessionStateData`. Both `deleteInitialComment` (gadget-driven) and `clearInitialComment` (sidecar-driven) now set the flag to `true` after disposing of the comment. The post-agent hook checks the flag first and skips the entire deletion path — including the legacy `agentInput.ackCommentId` fallback — when consumed. As defense in depth, `githubClient.deletePRComment` now treats HTTP 404 as success (RFC-7231 idempotency) and logs at DEBUG instead of letting the error bubble as a WARN; other HTTP errors (5xx, 401, network) continue to throw. The legacy fallback to `agentInput.ackCommentId` continues to work for code paths that never populate session state. See [spec 017](docs/specs/017-router-silent-failure-hardening.md).
- **PM image delivery: Linear GraphQL fixture + extraction-coverage regression test** (spec 016, plan 3 of 3). Captures a reconstructed Linear `Issue` GraphQL payload at `tests/fixtures/linear-issue-with-screenshot.json` containing extension-less and extensioned inline-pasted images (description + comment bodies) plus formal Attachment records (Slack/GitHub/Sentry link previews) that must NOT be mistaken for inline images. The unit test at `tests/unit/pm/linear/extraction-coverage.test.ts` pins the contract and fails loudly with a specific URL-missing message if Linear ever changes its payload shape in a way that loses inline images. Documents the conclusion in `src/integrations/README.md`: `Issue.description` markdown is canonical for Linear inline images; `Issue.attachments` is the wrong surface (formal Attachment records, not pastes). No production code change — this plan ships the regression net for the contract Plans 1+2 established. See [spec 016](docs/specs/016-pm-image-delivery-reliability.md).
Expand Down
2 changes: 2 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ Three separate services, **no monolithic server mode**:

Flow: `PM/SCM/alerting webhook → Router → Redis → Worker → TriggerRegistry → Agent → Code → PR`.

**Capacity-gate invariant.** Every PM router adapter (`src/router/adapters/{linear,trello,jira}.ts`) must wrap `triggerRegistry.dispatch(ctx)` in PM-provider `AsyncLocalStorage` scope via the shared `withPMScopeForDispatch(fullProject, dispatch)` helper at `src/router/adapters/_shared.ts` — in addition to the per-PM-type credential scope (`withLinearCredentials` / `withTrelloCredentials` / `withJiraCredentials`). Without the PM-provider wrapping, the pipeline-capacity gate at `src/triggers/shared/pipeline-capacity-gate.ts` cannot resolve `getPMProvider()`, **fails closed** under the spec-017 fail-closed policy (blocks the run + ERROR + Sentry capture under tag `pipeline_capacity_gate_no_pm_provider`), and `maxInFlightItems` is silently disabled for the PM-source path. Mirror the GitHub adapter's existing correct shape at `src/router/adapters/github.ts:dispatchWithCredentials`. The static guard at `tests/unit/integrations/pm-router-adapter-pm-scope.test.ts` enforces this at CI time — adding a new PM router adapter without the wrapping fails CI with a precise file path.

Integration abstraction lives in `src/integrations/`. For **adding a new PM provider**, see @src/integrations/README.md — PM providers (Trello, JIRA, Linear) use the `PMProviderManifest` registry with a **behavioral conformance harness** (spec 009 — config round-trip, discovery shape, full lifecycle scenario, auth-header provenance, single-entrypoint invariant). Each provider owns its Zod config schema (`src/integrations/pm/<provider>/config-schema.ts`) as the single source of truth — the central `src/config/schema.ts` imports it. PM adapter method signatures use branded `StateId` / `LabelId` / `ContainerId` from `src/pm/ids.ts` to make state-name-vs-ID confusion a compile error at direct-adapter call sites. All runtime surfaces (router, worker, CLI, dashboard) register integrations through a single entrypoint at `src/integrations/entrypoint.ts`. **Spec 010 follow-ups** added generic `pm.discovery.createLabel` / `createCustomField` mutation endpoints + `currentUser` discovery capability + real shared React components for every `StandardStepKind` under `web/src/components/projects/pm-providers/steps/`. **Spec 011** migrated all three production providers (Trello, JIRA, Linear) onto those shared components, added a 7th `StandardStepKind: custom-field-mapping`, widened `container-pick` / `project-scope` / `webhook-url-display` with optional props, and deleted the three legacy `pm-wizard-{trello,jira,linear}-steps.tsx` files. **Spec 012** migrated each provider's webhook UX (programmatic create for Trello/JIRA, signing-secret + instructions for Linear) into per-provider manifest webhook adapters (Fragment compositions around the shared `WebhookUrlDisplayStep`); deleted the legacy `WebhookStep` + `LinearWebhookInfoPanel` + `useWebhookManagement` + `useLinearWebhookInfo`. Every PM wizard step now renders via the manifest path without exception. A new PM provider writes zero edits to shared orchestration (`pm-wizard.tsx`, `pm-wizard-common-steps.tsx`, `pm-wizard-hooks.ts`); provider-specific UI ships either as `kind: 'custom'` steps or as Fragment compositions inside the provider folder's wizard adapters. SCM (GitHub) and alerting (Sentry) still use the legacy `IntegrationModule` pattern via self-registration in `src/github/register.ts` + `src/sentry/register.ts`. Don't improvise; the README covers both patterns.

## PR checkout (worker) — gotcha
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ plan_slug: capacity-gate-pm-scope
level: plan
parent_spec: docs/specs/017-router-silent-failure-hardening.md
depends_on: []
status: pending
status: done
---

# 017/2: Pipeline-capacity-gate PM-provider scope
Expand Down Expand Up @@ -210,14 +210,14 @@ n/a — all ACs auto-tested.
## Progress

<!-- /implement updates these as it works. Do not edit manually. -->
- [ ] AC #1
- [ ] AC #2
- [ ] AC #3
- [ ] AC #4
- [ ] AC #5
- [ ] AC #6
- [ ] AC #7
- [ ] AC #8
- [ ] AC #9
- [ ] AC #10
- [ ] AC #11
- [x] AC #1 — `withPMScopeForDispatch(project, dispatch)` exists at `src/router/adapters/_shared.ts`, resolves PM provider via `createPMProvider`, wraps in `withPMProvider` (verified by `tests/unit/router/adapters/with-pm-scope-for-dispatch.test.ts`'s 4 tests covering happy path, return passthrough, null result, error propagation).
- [x] AC #2 — Linear, Trello, JIRA router adapters each consume the shared helper inside `dispatchWithCredentials` (verified by per-adapter regression tests + the static guard).
- [x] AC #3 — `getPMProvider()` succeeds inside trigger handlers dispatched from any of the three PM router adapters (verified by the helper's positive-path test, which exercises `getPMProvider()` from inside the wrapped callback).
- [x] AC #4 — Capacity-gate fail-closed branch: ERROR log + Sentry tag `pipeline_capacity_gate_no_pm_provider` + `return true` (verified by `pipeline-capacity-gate.test.ts > FAILS CLOSED (blocks) when no PM provider scope is available`).
- [x] AC #5 — Positive path still returns `isActivePipelineOverCapacity`'s result; non-slot-consuming early-return preserved (verified by 2 new positive-path tests + the existing non-slot-consuming test).
- [x] AC #6 — Static guard `tests/unit/integrations/pm-router-adapter-pm-scope.test.ts` asserts each PM router adapter source contains either `withPMScopeForDispatch` or `withPMProvider`; CI fails when missing.
- [x] AC #7 — CLAUDE.md gains "Capacity-gate invariant" paragraph in the Architecture section.
- [x] AC #8 — All new/modified code has tests (+4 helper, +3 capacity-gate, +3 static guard, plus passthrough mocks added to 3 PM-source trigger tests + 3 PM router adapter tests).
- [x] AC #9 — `npm run typecheck` passes.
- [x] AC #10 — `npm test` (full unit suite) passes (474 files / 8677 tests, +9 from baseline after Plan 1).
- [x] AC #11 — `npm run lint` passes.
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ slug: router-silent-failure-hardening
level: spec
title: Router-side silent-failure hardening
created: 2026-04-29
status: draft
status: done
---

# 017: Router-side silent-failure hardening
Expand Down
12 changes: 7 additions & 5 deletions src/gadgets/sessionState.ts
Original file line number Diff line number Diff line change
Expand Up @@ -183,16 +183,18 @@ export class SessionState {
const commentId = this.state.initialCommentId;
if (!commentId) return;

// Clear state first so the post-agent callback sees null and short-circuits
// Clear the id eagerly so concurrent reads can't observe a stale value.
// The post-agent callback's actual gate is the `initialCommentIdConsumed`
// flag set below — once that's true, the callback's legacy fallback to
// `agentInput.ackCommentId` is also short-circuited.
this.state.initialCommentId = null;

try {
const { githubClient } = await import('../github/client.js');
await githubClient.deletePRComment(owner, repo, commentId);
// Mark consumed so the post-agent callback's legacy fallback to
// `agentInput.ackCommentId` does not re-issue a DELETE for the
// same id. `deletePRComment` swallows 404 internally, so reaching
// here without throwing covers both 200/204 and 404 outcomes.
// `deletePRComment` swallows 404 internally, so reaching here without
// throwing covers both 200/204 (we deleted) and 404 (someone else
// already did) outcomes — both mean the comment is gone.
this.state.initialCommentIdConsumed = true;
} catch {
// Best-effort: restore the id so post-agent callback can retry.
Expand Down
42 changes: 42 additions & 0 deletions src/router/adapters/_shared.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
/**
* Shared helpers for router platform adapters.
*
* Spec 017 / plan 2: PM router adapters (Linear/Trello/JIRA) wrap their
* `triggerRegistry.dispatch(ctx)` invocation in `withPMProvider` scope using
* the helper below. Without it, `getPMProvider()` calls inside trigger
* handlers — notably the pipeline-capacity gate at
* `src/triggers/shared/pipeline-capacity-gate.ts` — throw, the gate fails
* closed under spec 017's fail-closed policy, and Sentry captures under
* tag `pipeline_capacity_gate_no_pm_provider`.
*
* Mirrors the GitHub router adapter's existing correct shape at
* `src/router/adapters/github.ts:dispatchWithCredentials` which has wrapped
* dispatch in PM-provider scope since spec 006.
*
* The helper does NOT establish credential scope — that's each adapter's
* concern. PM-provider scope layers on TOP of the credential scope.
*/

import { withPMProvider } from '../../pm/context.js';
import { createPMProvider } from '../../pm/index.js';
import type { ProjectConfig } from '../../types/index.js';

/**
* Wrap a dispatch callback in PM-provider AsyncLocalStorage scope so that
* `getPMProvider()` succeeds inside trigger handlers downstream.
*
* Resolves the PMProvider via `createPMProvider(project)` (the legacy
* compatibility adapter that delegates to the manifest registry's
* `pmIntegration.createProvider(project)`) and runs `dispatch` inside
* `withPMProvider(provider, dispatch)`.
*
* Returns whatever `dispatch` returns. Errors thrown by `dispatch`
* propagate unchanged.
*/
export function withPMScopeForDispatch<T>(
project: ProjectConfig,
dispatch: () => Promise<T>,
): Promise<T> {
const provider = createPMProvider(project);
return withPMProvider(provider, dispatch);
}
3 changes: 2 additions & 1 deletion src/router/adapters/jira.ts
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import type { AckResult, ParsedWebhookEvent, RouterPlatformAdapter } from '../pl
import { resolveJiraCredentials } from '../platformClients/index.js';
import type { CascadeJob, JiraJob } from '../queue.js';
import { sendAcknowledgeReaction } from '../reactions.js';
import { withPMScopeForDispatch } from './_shared.js';

const PROCESSABLE_EVENTS = [
'jira:issue_updated',
Expand Down Expand Up @@ -127,7 +128,7 @@ export class JiraRouterAdapter implements RouterPlatformAdapter {
const ctx: TriggerContext = { project: fullProject, source: 'jira', payload };
return withJiraCredentials(
{ email: jiraCreds.email, apiToken: jiraCreds.apiToken, baseUrl: jiraCreds.baseUrl },
() => triggerRegistry.dispatch(ctx),
() => withPMScopeForDispatch(fullProject, () => triggerRegistry.dispatch(ctx)),
);
}

Expand Down
3 changes: 2 additions & 1 deletion src/router/adapters/linear.ts
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import { loadProjectConfig, type RouterProjectConfig } from '../config.js';
import type { AckResult, ParsedWebhookEvent, RouterPlatformAdapter } from '../platform-adapter.js';
import { resolveLinearCredentials } from '../platformClients/index.js';
import type { CascadeJob, LinearJob } from '../queue.js';
import { withPMScopeForDispatch } from './_shared.js';

// ============================================================================
// Processable event combinations (action/type)
Expand Down Expand Up @@ -237,7 +238,7 @@ export class LinearRouterAdapter implements RouterPlatformAdapter {

const ctx: TriggerContext = { project: fullProject, source: 'linear', payload };
return withLinearCredentials({ apiKey: linearCreds.apiKey }, () =>
triggerRegistry.dispatch(ctx),
withPMScopeForDispatch(fullProject, () => triggerRegistry.dispatch(ctx)),
);
}

Expand Down
5 changes: 4 additions & 1 deletion src/router/adapters/trello.ts
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import {
isReadyToProcessLabelAdded,
isSelfAuthoredTrelloComment,
} from '../trello.js';
import { withPMScopeForDispatch } from './_shared.js';

export class TrelloRouterAdapter implements RouterPlatformAdapter {
readonly type = 'trello' as const;
Expand Down Expand Up @@ -125,7 +126,9 @@ export class TrelloRouterAdapter implements RouterPlatformAdapter {
}

const ctx: TriggerContext = { project: fullProject, source: 'trello', payload };
return withTrelloCredentials(trelloCreds, () => triggerRegistry.dispatch(ctx));
return withTrelloCredentials(trelloCreds, () =>
withPMScopeForDispatch(fullProject, () => triggerRegistry.dispatch(ctx)),
);
}

async postAck(
Expand Down
29 changes: 26 additions & 3 deletions src/triggers/shared/pipeline-capacity-gate.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

import { getPMProvider } from '../../pm/context.js';
import type { PMProvider } from '../../pm/types.js';
import { captureException } from '../../sentry.js';
import type { ProjectConfig } from '../../types/index.js';
import { logger } from '../../utils/logging.js';
import { isActivePipelineOverCapacity } from './backlog-check.js';
Expand All @@ -34,14 +35,36 @@ export async function shouldBlockForPipelineCapacity(args: {
try {
provider = getPMProvider();
} catch (err) {
// No credential scope — conservative: allow.
logger.warn('pipeline-capacity-gate: PM provider unavailable, allowing run', {
// Spec 017 / plan 2: fail closed.
//
// Before plan 2, this branch logged WARN and returned `false` (allow)
// because the PM router adapters dispatched outside PM-provider scope
// — hitting this branch was the routine path for every PM
// `status-changed` trigger. After plan 2 wraps every PM router adapter
// in `withPMScopeForDispatch`, hitting this branch represents a real
// AsyncLocalStorage scope leak that operators need to investigate.
// Failing closed (block + error + Sentry) is preferable to silently
// failing open and re-introducing the original incident class
// (multiple concurrent implementation runs against a project pinned
// to `maxInFlightItems: 1`).
const error = err instanceof Error ? err : new Error(String(err));
logger.error('pipeline-capacity-gate: PM provider unavailable, blocking run', {
source: args.source,
projectId: args.project.id,
workItemId: args.workItemId,
agentType: args.agentType,
error: String(err),
});
return false;
captureException(error, {
tags: { source: 'pipeline_capacity_gate_no_pm_provider' },
extra: {
projectId: args.project.id,
workItemId: args.workItemId,
agentType: args.agentType,
triggerSource: args.source,
},
});
return true;
}

const result = await isActivePipelineOverCapacity(args.project, provider, {
Expand Down
Loading
Loading