Skip to content

feat(subprocess): observable subprocess helper — streaming + heartbeat + dual timeouts (spec 013)#1177

Merged
zbigniewsobiecki merged 4 commits intodevfrom
feat/013-observable-subprocess-helper
Apr 24, 2026
Merged

feat(subprocess): observable subprocess helper — streaming + heartbeat + dual timeouts (spec 013)#1177
zbigniewsobiecki merged 4 commits intodevfrom
feat/013-observable-subprocess-helper

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Why

CASCADE agent runs on ucho and elsewhere have been timing out silently when cascade-tools scm create-pr triggers slow pre-push hooks. Root cause traced in two failed runs (MNG-287 timed out at 29m59s, MNG-290 hung 7+ min on README PR):

runCommand() in src/utils/repo.ts uses spawn() with stdio:['pipe','pipe','pipe'] and buffers all child output into in-memory strings, emitting nothing until the subprocess exits. When the target repo's pre-push hook runs pnpm typecheck + pnpm test:run (~60s), the LLM-driven agent sees an empty output file, assumes hang, and burns 5–10 minutes retrying. Bonus issue: on successful push, captured output was discarded — agents never saw what the hook did. And no per-subprocess timeout (gadget-level 240s doesn't kill the child).

What ships

Replaces the hand-rolled spawn() wrapper with an execa + tree-kill implementation (industry default, ~140M dl/mo). New behavior:

  • Live streaming of child stdout/stderr to parent stderr line-by-line as it arrives
  • Heartbeat line [label] still running (Ns) on parent stderr every 30s of child silence (configurable, re-armed on each child output chunk)
  • Idle-silence timeout (default 120s) — kills the child and its descendants if no output for N ms
  • Wall-clock timeout (default 600s) — kills the child on total runtime, not reset by output
  • SIGTERM→SIGKILL ladder with a 5s grace window; process-group kill via tree-kill so hook grandchildren (test runners, subshells) die too
  • Captured output preserved in the result on success AND failure (previously discarded on success)
  • command bootstrap not found warning silenced — oclif command-loader glob now excludes bootstrap.js (it's a side-effect import, not a command)

Backward-compatible: existing { stdout, stderr, exitCode } result shape preserved; new optional reason: 'idle-timeout' | 'wall-timeout' surfaces when the helper killed the child.

Callers of runCommand inherit the new defaults for free. The two slow callsites in createPR (git push, git commit) pass explicit tighter timeouts (push: wall=230s idle=90s — just under the gadget's 240s ceiling; commit: wall=120s idle=60s) and now return captured hook output via new optional CreatePRResult.pushOutput and CreatePRResult.commitOutput fields.

Artifacts

  • Spec: docs/specs/013-subprocess-output-streaming.md.done
  • Plan: docs/plans/013-subprocess-output-streaming/1-observable-subprocess-helper.md.done
  • Coverage map: docs/plans/013-subprocess-output-streaming/_coverage.md

Tests

16 new tests (12 in tests/unit/utils/repo.test.ts, 4 in tests/unit/gadgets/github/core/createPR.test.ts):

  • streaming stdout / stderr line-by-line
  • heartbeat emission on silence (with elapsed-time + label assertion)
  • heartbeat resets on child output
  • no heartbeat for short commands (early exit)
  • idle-timeout SIGTERM + SIGKILL escalation
  • wall-timeout SIGTERM
  • captured output preserved on success / non-zero exit / silent mode
  • backward-compat 3-field signature
  • pushBranch passes explicit wallTimeoutMs ≤ 230_000 + finite idleTimeoutMs
  • pushBranch / stageAndCommit return captured hook output on success

4 existing tests aligned to the new call-shape (github.test.ts assertions on push/commit invocations).

Total: 8338/8338 unit tests pass (up from 8333, +5 net; 11 buffered-spawn tests retired, 16 new).

Verification

Gate Result
npm run build
npm run typecheck ✅ clean
npm run lint ✅ clean (cognitive-complexity refactor done on green)
npm test ✅ 8338/8338 + 23 pre-existing skipped

Pre-existing warnings / errors found: none. Lint + typecheck were already clean pre-change.

Spec AC #9[manual], deferred

Spec AC #9 ("a CASCADE agent sees hook progress within ~30s and does not retry") requires a live agent run. Cannot be exercised inside the unit/integration suite. The verification protocol lives in the plan's Manual Verification section — run it after this lands and a worker picks up the updated image.

Scope

All changes in src/utils/repo.ts, src/gadgets/github/core/createPR.ts, bin/cascade-tools.js, the two test files, + docs (README.md, docs/cascade-directory.md, CHANGELOG.md). No schema changes, no transport changes, no UI. Dashboard / router / worker untouched.

🤖 Generated with Claude Code

zbigniewsobiecki and others added 4 commits April 24, 2026 11:27
Captures the /specify + /plan artifacts for the observable subprocess
helper that fixes cascade-tools' silent hang during slow pre-push hooks
(seen in ucho MNG-287 and MNG-290). Single-plan decomposition —
implementation lands in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the spawn-based runCommand() in src/utils/repo.ts with an
execa + tree-kill implementation that:
- forwards child stdout/stderr to parent stderr line-by-line as it arrives
- emits a [label] still running (Ns) heartbeat on stderr every 30s of
  child silence (configurable)
- enforces idle-silence timeout (default 120s) and wall-clock timeout
  (default 600s) with SIGTERM→SIGKILL escalation via tree-kill
- preserves captured stdout/stderr in the result on success AND failure
- keeps the { stdout, stderr, exitCode } contract for all existing
  callers (new optional `reason` field for timeout-caused kills)

createPR gadget now passes tighter explicit timeouts on the two calls
that trigger user hooks: git push gets wallTimeoutMs=230_000 (just under
the gadget's 240s ceiling) and idleTimeoutMs=90_000; git commit gets
120_000/60_000. Both return captured hook output to the caller via new
optional CreatePRResult.pushOutput and CreatePRResult.commitOutput
fields; previously the successful-push branch discarded the output.

bin/cascade-tools.js globPatterns now excludes bootstrap.js, silencing
the oclif 'command bootstrap not found' warning that appeared on every
invocation.

8338/8338 unit tests pass; 0 lint errors; 0 type errors. Spec AC #9
(end-to-end agent verification) is tagged [manual] and deferred to a
live agent run post-merge — see plan's Manual Verification section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CLAUDE.md untouched by this spec — nothing to audit.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 91.44737% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/utils/repo.ts 88.49% 11 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@zbigniewsobiecki zbigniewsobiecki merged commit 045d9db into dev Apr 24, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant