fix(daemon): startup grace period to prevent false-positive stall alerts#12
Merged
fix(daemon): startup grace period to prevent false-positive stall alerts#12
Conversation
… alerts Builders in a silent planning/thinking phase produce no output and no tool calls, making them appear stalled when they are actually working. Previously, 0-turn agents were invisible to stall detection (lastTs undefined → condition short-circuits to false). Fix: introduce startupGracePeriodMs in HealthCheckConfig. During the grace period, 0-turn agents are explicitly skipped. After the grace period, entry.createdAt is used as the baseline so genuine hangs (agent spawned but never started) are still caught. The stall message now distinguishes "no turns since spawn" from "no progress since last turn" for clearer triage. Also: fetch origin before creating a worktree from a local repo so builder branches always start from the latest upstream main rather than a stale local ref. getDefaultBranch now prefers origin/<branch> when the remote tracking ref exists. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The observed evidence (builders firing false-positive stall alerts at 31–102s after spawn) calls for a default that reflects reality: a builder's silent planning phase reliably takes 60–120s. Using stallThresholdMs (30s) as the default would still fire false positives. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The SDK emits SDKStatusMessage{status:'requesting'} when an API call
starts, before any text chunks arrive. This is the exact window where
the health monitor was firing false-positive stall alerts — the model
is actively thinking, not stalled.
Wire the signal through: worker.ts listens for type='system' /
subtype='status' / status='requesting' and emits api-active via IPC.
lifecycle.ts handles api-active the same as chunk-received (resets
lastChunkAt, clears toolCallActive + waitingForMail), so the stall
timer resets on every API invocation even in the silent planning phase.
The startup grace period remains as a safety net for the very first
invocation before the SDK has emitted any status messages.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ives The stall check's 3-condition IPC path (no chunk + no tool + not waiting for mail) still fires when the model is silently thinking between turns: the query() generator has been entered but no SDK messages have arrived yet. Fix: emit query-started from worker.ts immediately before the for-await loop. lifecycle.ts sets queryInFlight=true on this event and clears it on turn-complete (or error). agent-health.ts adds !stallState.queryInFlight as a 4th condition, so an active API request can never be misread as a stall. The startup grace period (startupGracePeriodMs) is retained as a safety net for the very first invocation, before the worker has had a chance to emit query-started. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
runHealthCheckstall detection checkedif (lastTs && ...), which silently skipped 0-turn agents. Builders spend 60–120s in a silent planning phase (reading the epic, calling Linear, creating tasks) before producing any IPC heartbeats, so the 30s stall threshold fires false positives.startupGracePeriodMs(default: 2 minutes) toHealthCheckConfig. During the grace period, 0-turn agents are explicitly skipped. After the grace period,entry.createdAtis used as the baseline — so genuine hangs are still caught.workspace.tsnow callsgit fetch originfor local-repo worktrees before branching, andgetDefaultBranchprefersorigin/<branch>so builder branches always start from latest upstream main.Changes
services/friday/src/monitor/agent-health.ts—startupGracePeriodMsconfig field, refactored stall block to handle grace period + IPC + legacy + 0-turn pathsservices/friday/src/monitor/agent-health.test.ts— 5 new tests inrunHealthCheck — startup grace perioddescribe block (18 total, all passing)services/friday/src/agent/workspace.ts—fetchOriginSilentlyhelper, called for local repos;getDefaultBranchprefers remote tracking refdocs/architecture.md— updated agent-health row to document new behaviorTest plan
pnpm --filter @friday/daemon exec vitest run src/monitor/agent-health.test.ts— 18/18 passpnpm test— 353 tests across 25 daemon files pass; all other packages passpnpm --filter @friday/daemon exec tsc --noEmit— no type errorspnpm --filter @friday/cli exec tsc --noEmit— no type errorsFixes FRI-48.
🤖 Generated with Claude Code