feat(agents): agent opt-in enforcement + unified run timeout system by zbigniewsobiecki · Pull Request #909 · mongrel-intelligence/cascade

zbigniewsobiecki · 2026-03-16T16:25:14Z

Summary

Two related improvements that complete the agent opt-in enforcement feature and fix a class of DB consistency bugs in the router's timeout handling.

Agent opt-in enforcement (completing existing branch work)

Require explicit agent config: agents only run for a project when an agent_configs row exists — no more "everything enabled by default"
Expose YAML default task prompt: getDefaultTaskPrompt() reads the factory-default task prompt from YAML definitions and surfaces it through the getPrompts API as a fourth inheritance layer (defaultTaskPrompt)
Dashboard prompt editor: falls back to the YAML default task prompt when no project/global override exists; "Load default" button now restores the YAML default rather than the global prompt
Manual runs respect watchdog timeout: triggerManualRun now calls startWatchdog(project.watchdogTimeoutMs) so manual runs are subject to the same per-project timeout as webhook-triggered runs

Unified agent run timeout system

Replaces two independent, uncoordinated timeout mechanisms with a single coherent flow where watchdogTimeoutMs is the source of truth.

Before: two timeouts with no knowledge of each other:

In-container watchdog (startWatchdog) — per-project, updates DB to timed_out then exits
Router-level kill (setTimeout → killWorker) — global env var, killed the Docker container with no DB update

This meant GitHub PR runs (no workItemId) stayed running in the DB forever after a router kill, and orphaned containers (after router restart) were stopped but their runs were never updated.

After — one coherent flow:

Router reads watchdogTimeoutMs from project config and sets its kill timer to watchdogTimeoutMs + 2 min (the ROUTER_KILL_BUFFER_MS backstop), so the watchdog always fires first
killWorker now marks the run timed_out in the DB after stopping the container, covering both workItemId runs and GitHub PR runs (via new failOrphanedRunFallback)
Fixed race condition: cleanupWorker is called without an exit code from killWorker, preventing it from also firing a DB update with the wrong failed status
cascade.agent.type label added to containers so orphan cleanup can narrow its fallback query to the correct agent type
All fail paths now record durationMs so dashboard users see actual run time instead of null
BullMQ lockDuration changed from workerTimeoutMs + 60s to a fixed 8-hour constant to prevent lock expiry on long-running projects

Test plan

npm run typecheck — clean
npm run lint — clean
npm test — 5411/5411 tests passing (all previously-failing agentConfigs and manual-runner tests now fixed)
New unit tests for failOrphanedRunFallback in active-workers.test.ts, container-manager.test.ts, and orphan-cleanup.test.ts
toHaveBeenCalledTimes(1) assertions on killWorker DB calls to lock out the double-update regression
Tests for cascade.agent.type label passthrough in orphan cleanup
Manual: trigger a run, verify it times out at watchdogTimeoutMs with timed_out status in dashboard
Manual: simulate watchdog failure, verify router kill at +2 min shows timed_out (not running) in dashboard

🤖 Generated with Claude Code

The new `isAgentEnabledForProject()` guard requires an explicit `agent_configs` row before any trigger can fire. Integration tests that called `handle()` and expected a non-null result were failing because no agent config was seeded. - Seed `agent_configs` + `agent_trigger_configs` rows in tests that expect triggers to fire (trigger-registry, pm-provider-switching, github-personas) - Export `clearAgentEnabledCache()` from agentConfigsRepository for test isolation (the 5s TTL cache was causing false negatives when tests ran within the TTL window of a prior test that had no config) - Call `clearAgentEnabledCache()` in `truncateAll()` so every `beforeEach` starts with a clean cache alongside a clean DB Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add `getDefaultTaskPrompt(agentType)` to `src/agents/prompts/index.ts` Reads the factory-default task prompt directly from the YAML definition without requiring `initPrompts()`. Returns null for unknown agent types. - Wire it into the `agentConfigs.getPrompts` tRPC endpoint as a fourth prompt layer (`defaultTaskPrompt`), completing the inheritance chain: project override → global override → default system (disk template) → default task (YAML definition). - Update `agent-prompt-overrides.tsx` to use `defaultTaskPrompt` as the final fallback when initialising the task prompt editor and as the target of the "Load default" button. - Add `startWatchdog(project.watchdogTimeoutMs)` to `triggerManualRun` so manual runs respect the per-project timeout the same way webhook- triggered runs do. - Fix unit tests: add `getDefaultTaskPrompt` to the `prompts/index.js` mock in agentConfigs.test.ts; mock `lifecycle.js` in manual-runner.test.ts to prevent the watchdog timer from calling process.exit during tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces two independent, uncoordinated timeout mechanisms with a single coherent flow where `watchdogTimeoutMs` is the source of truth. ## Problem Two timeouts existed with no knowledge of each other: 1. In-container watchdog (`startWatchdog(project.watchdogTimeoutMs)`) — per-project, updates DB to `timed_out` then exits. 2. Router-level kill (`setTimeout → killWorker`) — global env var, killed the Docker container with no DB update. This caused three bugs: - If `WORKER_TIMEOUT_MS` < `watchdogTimeoutMs` the router killed the container before the watchdog could set the correct DB status. - GitHub-triggered runs (no `workItemId`) were never marked in the DB after a router kill — they stayed `running` forever. - Orphaned containers (after router restart) were stopped but their DB runs were never updated. ## Solution **Per-project timeout in `spawnWorker`**: router now reads `watchdogTimeoutMs` from project config and uses it + 2-minute buffer (`ROUTER_KILL_BUFFER_MS`) for the container kill timer, so the watchdog always fires first and the router is purely a backstop. **DB update on router kill (`killWorker`)**: after stopping the container, marks the run `timed_out` via `failOrphanedRun` (workItemId path) or `failOrphanedRunFallback` (GitHub PR runs without workItemId). The call to `cleanupWorker` no longer passes an exit code so it skips its own DB write, eliminating the race that could set the wrong status (`failed` instead of `timed_out`). **Fallback for GitHub PR runs (`failOrphanedRunFallback`)**: new repository function that finds the most recent running run by `projectId + agentType + startedAt ≥ containerStart` and marks it, guarded by an optimistic `WHERE status='running'` check so it is always safe to call even if the watchdog already acted. **DB update in `cleanupWorker`**: extended to also handle the workItemId-absent case via `failOrphanedRunFallback`, covering crashes of GitHub PR runs that the watchdog didn't catch. **`cascade.agent.type` container label**: added at spawn time so orphan cleanup can pass `agentType` to `failOrphanedRunFallback`, avoiding matching the wrong run when multiple agent types run concurrently. **`durationMs` on orphaned runs**: all three fail paths now compute and persist the elapsed duration so dashboard users see actual run time instead of null. **Fixed BullMQ `lockDuration`**: replaced `workerTimeoutMs + 60s` with a fixed 8-hour constant (`BULLMQ_LOCK_DURATION_MS`) — `guardedSpawn` resolves immediately after container start so the lock is held for seconds, and tying it to `workerTimeoutMs` risked lock expiry for long-running project configs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

zbigniewsobiecki · 2026-03-16T16:33:14Z

Closing — recreating PR from a clean branch against dev to avoid duplicate commit history from PR #897

Cascade Bot and others added 6 commits March 16, 2026 13:14

feat(agents): require explicit agent config to enable agent per project

d24fd80

chore: resolve merge conflict with dev

0a33eb0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci: trigger CI run

cea359c

zbigniewsobiecki closed this Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agents): agent opt-in enforcement + unified run timeout system#909

feat(agents): agent opt-in enforcement + unified run timeout system#909
zbigniewsobiecki wants to merge 6 commits intodevfrom
feature/agent-opt-in-enforcement

zbigniewsobiecki commented Mar 16, 2026

Uh oh!

zbigniewsobiecki commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zbigniewsobiecki commented Mar 16, 2026

Summary

Agent opt-in enforcement (completing existing branch work)

Unified agent run timeout system

Test plan

Uh oh!

zbigniewsobiecki commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant