Skip to content

feat: unified run timeout system + YAML default task prompt#910

Merged
zbigniewsobiecki merged 2 commits intodevfrom
feat/unified-timeout-and-default-task-prompt
Mar 16, 2026
Merged

feat: unified run timeout system + YAML default task prompt#910
zbigniewsobiecki merged 2 commits intodevfrom
feat/unified-timeout-and-default-task-prompt

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Summary

Two improvements on top of the agent opt-in enforcement work already merged in #897.

1. YAML default task prompt in the prompt editor

  • getDefaultTaskPrompt(agentType) — new function in src/agents/prompts/index.ts that reads the factory-default task prompt directly from YAML definitions without requiring initPrompts(). Returns null for unknown agent types.
  • agentConfigs.getPrompts endpoint gains a fourth defaultTaskPrompt field, completing the four-layer inheritance chain: project override → global override → default system (disk template) → default task (YAML definition).
  • Dashboard prompt editor now falls back to defaultTaskPrompt when initialising and the "Load default" button restores the YAML definition rather than the global prompt.
  • triggerManualRun now calls startWatchdog(project.watchdogTimeoutMs) — manual runs now respect the same per-project timeout as webhook-triggered runs.

2. Unified agent run timeout system

Replaces two independent, uncoordinated timeout mechanisms with a single coherent flow where watchdogTimeoutMs is the authoritative timeout.

Before: two timeouts that didn't know about each other:

  1. In-container watchdog (startWatchdog) — per-project, updates DB to timed_out then exits
  2. Router-level kill (setTimeout → killWorker) — global env var, killed the Docker container with no DB update

This caused GitHub PR runs (no workItemId) to stay running in the DB forever after a router kill, and orphaned containers (after router restart) were stopped but their runs were never updated.

After — one coherent flow:

  • Per-project router timeout: reads watchdogTimeoutMs from project config and applies a ROUTER_KILL_BUFFER_MS (2 min) backstop, so the in-container watchdog always fires first
  • DB update on router kill: killWorker marks the run timed_out after stopping the container, via failOrphanedRun (workItemId path) or new failOrphanedRunFallback (GitHub PR runs)
  • Race condition fixed: cleanupWorker is called without an exit code from killWorker, preventing a concurrent write with the wrong failed status
  • cascade.agent.type container label: orphan cleanup now reads agent type from container labels to narrow its fallback query when multiple agent types run concurrently
  • durationMs on all fail paths: timed-out and crashed runs now record elapsed duration instead of null
  • Fixed BullMQ lockDuration: changed from workerTimeoutMs + 60s to a fixed 8-hour constant (BULLMQ_LOCK_DURATION_MS) to prevent lock expiry for long-running projects

Test plan

  • npm run typecheck — clean
  • npm run lint — clean
  • npm test — 5411/5411 passing (includes all previously-failing agentConfigs and manual-runner tests, now fixed)
  • toHaveBeenCalledTimes(1) assertions on killWorker DB calls lock out the double-update regression
  • New tests for cascade.agent.type label passthrough in orphan cleanup
  • New tests for failOrphanedRunFallback across active-workers, container-manager, and orphan-cleanup

🤖 Generated with Claude Code

zbigniewsobiecki and others added 2 commits March 16, 2026 16:33
- Add `getDefaultTaskPrompt(agentType)` to `src/agents/prompts/index.ts`
  Reads the factory-default task prompt directly from the YAML definition
  without requiring `initPrompts()`. Returns null for unknown agent types.

- Wire it into the `agentConfigs.getPrompts` tRPC endpoint as a fourth
  prompt layer (`defaultTaskPrompt`), completing the inheritance chain:
  project override → global override → default system (disk template) →
  default task (YAML definition).

- Update `agent-prompt-overrides.tsx` to use `defaultTaskPrompt` as the
  final fallback when initialising the task prompt editor and as the
  target of the "Load default" button.

- Add `startWatchdog(project.watchdogTimeoutMs)` to `triggerManualRun`
  so manual runs respect the per-project timeout the same way webhook-
  triggered runs do.

- Fix unit tests: add `getDefaultTaskPrompt` to the `prompts/index.js`
  mock in agentConfigs.test.ts; mock `lifecycle.js` in
  manual-runner.test.ts to prevent the watchdog timer from calling
  process.exit during tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces two independent, uncoordinated timeout mechanisms with a single
coherent flow where `watchdogTimeoutMs` is the source of truth.

## Problem

Two timeouts existed with no knowledge of each other:
1. In-container watchdog (`startWatchdog(project.watchdogTimeoutMs)`) —
   per-project, updates DB to `timed_out` then exits.
2. Router-level kill (`setTimeout → killWorker`) — global env var,
   killed the Docker container with no DB update.

This caused three bugs:
- If `WORKER_TIMEOUT_MS` < `watchdogTimeoutMs` the router killed the
  container before the watchdog could set the correct DB status.
- GitHub-triggered runs (no `workItemId`) were never marked in the DB
  after a router kill — they stayed `running` forever.
- Orphaned containers (after router restart) were stopped but their DB
  runs were never updated.

## Solution

**Per-project timeout in `spawnWorker`**: router now reads
`watchdogTimeoutMs` from project config and uses it + 2-minute buffer
(`ROUTER_KILL_BUFFER_MS`) for the container kill timer, so the watchdog
always fires first and the router is purely a backstop.

**DB update on router kill (`killWorker`)**: after stopping the
container, marks the run `timed_out` via `failOrphanedRun` (workItemId
path) or `failOrphanedRunFallback` (GitHub PR runs without workItemId).
The call to `cleanupWorker` no longer passes an exit code so it skips
its own DB write, eliminating the race that could set the wrong status
(`failed` instead of `timed_out`).

**Fallback for GitHub PR runs (`failOrphanedRunFallback`)**: new
repository function that finds the most recent running run by
`projectId + agentType + startedAt ≥ containerStart` and marks it,
guarded by an optimistic `WHERE status='running'` check so it is
always safe to call even if the watchdog already acted.

**DB update in `cleanupWorker`**: extended to also handle the
workItemId-absent case via `failOrphanedRunFallback`, covering crashes
of GitHub PR runs that the watchdog didn't catch.

**`cascade.agent.type` container label**: added at spawn time so orphan
cleanup can pass `agentType` to `failOrphanedRunFallback`, avoiding
matching the wrong run when multiple agent types run concurrently.

**`durationMs` on orphaned runs**: all three fail paths now compute and
persist the elapsed duration so dashboard users see actual run time
instead of null.

**Fixed BullMQ `lockDuration`**: replaced `workerTimeoutMs + 60s` with
a fixed 8-hour constant (`BULLMQ_LOCK_DURATION_MS`) — `guardedSpawn`
resolves immediately after container start so the lock is held for
seconds, and tying it to `workerTimeoutMs` risked lock expiry for
long-running project configs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki merged commit ae7c7df into dev Mar 16, 2026
6 checks passed
@zbigniewsobiecki zbigniewsobiecki deleted the feat/unified-timeout-and-default-task-prompt branch March 16, 2026 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant