feat: unified run timeout system + YAML default task prompt#910
Merged
zbigniewsobiecki merged 2 commits intodevfrom Mar 16, 2026
Merged
feat: unified run timeout system + YAML default task prompt#910zbigniewsobiecki merged 2 commits intodevfrom
zbigniewsobiecki merged 2 commits intodevfrom
Conversation
- Add `getDefaultTaskPrompt(agentType)` to `src/agents/prompts/index.ts` Reads the factory-default task prompt directly from the YAML definition without requiring `initPrompts()`. Returns null for unknown agent types. - Wire it into the `agentConfigs.getPrompts` tRPC endpoint as a fourth prompt layer (`defaultTaskPrompt`), completing the inheritance chain: project override → global override → default system (disk template) → default task (YAML definition). - Update `agent-prompt-overrides.tsx` to use `defaultTaskPrompt` as the final fallback when initialising the task prompt editor and as the target of the "Load default" button. - Add `startWatchdog(project.watchdogTimeoutMs)` to `triggerManualRun` so manual runs respect the per-project timeout the same way webhook- triggered runs do. - Fix unit tests: add `getDefaultTaskPrompt` to the `prompts/index.js` mock in agentConfigs.test.ts; mock `lifecycle.js` in manual-runner.test.ts to prevent the watchdog timer from calling process.exit during tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces two independent, uncoordinated timeout mechanisms with a single coherent flow where `watchdogTimeoutMs` is the source of truth. ## Problem Two timeouts existed with no knowledge of each other: 1. In-container watchdog (`startWatchdog(project.watchdogTimeoutMs)`) — per-project, updates DB to `timed_out` then exits. 2. Router-level kill (`setTimeout → killWorker`) — global env var, killed the Docker container with no DB update. This caused three bugs: - If `WORKER_TIMEOUT_MS` < `watchdogTimeoutMs` the router killed the container before the watchdog could set the correct DB status. - GitHub-triggered runs (no `workItemId`) were never marked in the DB after a router kill — they stayed `running` forever. - Orphaned containers (after router restart) were stopped but their DB runs were never updated. ## Solution **Per-project timeout in `spawnWorker`**: router now reads `watchdogTimeoutMs` from project config and uses it + 2-minute buffer (`ROUTER_KILL_BUFFER_MS`) for the container kill timer, so the watchdog always fires first and the router is purely a backstop. **DB update on router kill (`killWorker`)**: after stopping the container, marks the run `timed_out` via `failOrphanedRun` (workItemId path) or `failOrphanedRunFallback` (GitHub PR runs without workItemId). The call to `cleanupWorker` no longer passes an exit code so it skips its own DB write, eliminating the race that could set the wrong status (`failed` instead of `timed_out`). **Fallback for GitHub PR runs (`failOrphanedRunFallback`)**: new repository function that finds the most recent running run by `projectId + agentType + startedAt ≥ containerStart` and marks it, guarded by an optimistic `WHERE status='running'` check so it is always safe to call even if the watchdog already acted. **DB update in `cleanupWorker`**: extended to also handle the workItemId-absent case via `failOrphanedRunFallback`, covering crashes of GitHub PR runs that the watchdog didn't catch. **`cascade.agent.type` container label**: added at spawn time so orphan cleanup can pass `agentType` to `failOrphanedRunFallback`, avoiding matching the wrong run when multiple agent types run concurrently. **`durationMs` on orphaned runs**: all three fail paths now compute and persist the elapsed duration so dashboard users see actual run time instead of null. **Fixed BullMQ `lockDuration`**: replaced `workerTimeoutMs + 60s` with a fixed 8-hour constant (`BULLMQ_LOCK_DURATION_MS`) — `guardedSpawn` resolves immediately after container start so the lock is held for seconds, and tying it to `workerTimeoutMs` risked lock expiry for long-running project configs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two improvements on top of the agent opt-in enforcement work already merged in #897.
1. YAML default task prompt in the prompt editor
getDefaultTaskPrompt(agentType)— new function insrc/agents/prompts/index.tsthat reads the factory-default task prompt directly from YAML definitions without requiringinitPrompts(). Returns null for unknown agent types.agentConfigs.getPromptsendpoint gains a fourthdefaultTaskPromptfield, completing the four-layer inheritance chain: project override → global override → default system (disk template) → default task (YAML definition).defaultTaskPromptwhen initialising and the "Load default" button restores the YAML definition rather than the global prompt.triggerManualRunnow callsstartWatchdog(project.watchdogTimeoutMs)— manual runs now respect the same per-project timeout as webhook-triggered runs.2. Unified agent run timeout system
Replaces two independent, uncoordinated timeout mechanisms with a single coherent flow where
watchdogTimeoutMsis the authoritative timeout.Before: two timeouts that didn't know about each other:
startWatchdog) — per-project, updates DB totimed_outthen exitssetTimeout → killWorker) — global env var, killed the Docker container with no DB updateThis caused GitHub PR runs (no
workItemId) to stayrunningin the DB forever after a router kill, and orphaned containers (after router restart) were stopped but their runs were never updated.After — one coherent flow:
watchdogTimeoutMsfrom project config and applies aROUTER_KILL_BUFFER_MS(2 min) backstop, so the in-container watchdog always fires firstkillWorkermarks the runtimed_outafter stopping the container, viafailOrphanedRun(workItemId path) or newfailOrphanedRunFallback(GitHub PR runs)cleanupWorkeris called without an exit code fromkillWorker, preventing a concurrent write with the wrongfailedstatuscascade.agent.typecontainer label: orphan cleanup now reads agent type from container labels to narrow its fallback query when multiple agent types run concurrentlydurationMson all fail paths: timed-out and crashed runs now record elapsed duration instead of nulllockDuration: changed fromworkerTimeoutMs + 60sto a fixed 8-hour constant (BULLMQ_LOCK_DURATION_MS) to prevent lock expiry for long-running projectsTest plan
npm run typecheck— cleannpm run lint— cleannpm test— 5411/5411 passing (includes all previously-failingagentConfigsandmanual-runnertests, now fixed)toHaveBeenCalledTimes(1)assertions onkillWorkerDB calls lock out the double-update regressioncascade.agent.typelabel passthrough in orphan cleanupfailOrphanedRunFallbackacross active-workers, container-manager, and orphan-cleanup🤖 Generated with Claude Code