Skip to content

fix(daemon): startup grace period to prevent false-positive stall alerts#12

Merged
sethvoltz merged 4 commits intomainfrom
feature/stall-detection-fix
May 3, 2026
Merged

fix(daemon): startup grace period to prevent false-positive stall alerts#12
sethvoltz merged 4 commits intomainfrom
feature/stall-detection-fix

Conversation

@sethvoltz
Copy link
Copy Markdown
Owner

Summary

  • Root cause: runHealthCheck stall detection checked if (lastTs && ...), which silently skipped 0-turn agents. Builders spend 60–120s in a silent planning phase (reading the epic, calling Linear, creating tasks) before producing any IPC heartbeats, so the 30s stall threshold fires false positives.
  • Fix: Added startupGracePeriodMs (default: 2 minutes) to HealthCheckConfig. During the grace period, 0-turn agents are explicitly skipped. After the grace period, entry.createdAt is used as the baseline — so genuine hangs are still caught.
  • Messages: Stall alerts now distinguish "no turns since spawn" (0-turn post-grace) from "no stream progress for Xs" (mid-work stall).
  • Workspace fetch: workspace.ts now calls git fetch origin for local-repo worktrees before branching, and getDefaultBranch prefers origin/<branch> so builder branches always start from latest upstream main.

Changes

  • services/friday/src/monitor/agent-health.tsstartupGracePeriodMs config field, refactored stall block to handle grace period + IPC + legacy + 0-turn paths
  • services/friday/src/monitor/agent-health.test.ts — 5 new tests in runHealthCheck — startup grace period describe block (18 total, all passing)
  • services/friday/src/agent/workspace.tsfetchOriginSilently helper, called for local repos; getDefaultBranch prefers remote tracking ref
  • docs/architecture.md — updated agent-health row to document new behavior

Test plan

  • pnpm --filter @friday/daemon exec vitest run src/monitor/agent-health.test.ts — 18/18 pass
  • pnpm test — 353 tests across 25 daemon files pass; all other packages pass
  • pnpm --filter @friday/daemon exec tsc --noEmit — no type errors
  • pnpm --filter @friday/cli exec tsc --noEmit — no type errors

Fixes FRI-48.

🤖 Generated with Claude Code

sethvoltz and others added 4 commits May 2, 2026 21:32
… alerts

Builders in a silent planning/thinking phase produce no output and
no tool calls, making them appear stalled when they are actually
working. Previously, 0-turn agents were invisible to stall detection
(lastTs undefined → condition short-circuits to false).

Fix: introduce startupGracePeriodMs in HealthCheckConfig. During the
grace period, 0-turn agents are explicitly skipped. After the grace
period, entry.createdAt is used as the baseline so genuine hangs
(agent spawned but never started) are still caught. The stall message
now distinguishes "no turns since spawn" from "no progress since last
turn" for clearer triage.

Also: fetch origin before creating a worktree from a local repo so
builder branches always start from the latest upstream main rather
than a stale local ref. getDefaultBranch now prefers origin/<branch>
when the remote tracking ref exists.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The observed evidence (builders firing false-positive stall alerts at
31–102s after spawn) calls for a default that reflects reality: a
builder's silent planning phase reliably takes 60–120s. Using
stallThresholdMs (30s) as the default would still fire false positives.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The SDK emits SDKStatusMessage{status:'requesting'} when an API call
starts, before any text chunks arrive. This is the exact window where
the health monitor was firing false-positive stall alerts — the model
is actively thinking, not stalled.

Wire the signal through: worker.ts listens for type='system' /
subtype='status' / status='requesting' and emits api-active via IPC.
lifecycle.ts handles api-active the same as chunk-received (resets
lastChunkAt, clears toolCallActive + waitingForMail), so the stall
timer resets on every API invocation even in the silent planning phase.

The startup grace period remains as a safety net for the very first
invocation before the SDK has emitted any status messages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ives

The stall check's 3-condition IPC path (no chunk + no tool + not waiting
for mail) still fires when the model is silently thinking between turns:
the query() generator has been entered but no SDK messages have arrived yet.

Fix: emit query-started from worker.ts immediately before the for-await
loop. lifecycle.ts sets queryInFlight=true on this event and clears it on
turn-complete (or error). agent-health.ts adds !stallState.queryInFlight as
a 4th condition, so an active API request can never be misread as a stall.

The startup grace period (startupGracePeriodMs) is retained as a safety net
for the very first invocation, before the worker has had a chance to emit
query-started.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sethvoltz sethvoltz merged commit fd3f50f into main May 3, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant