chore: promote dev → main (worker exit diagnostics + agent docs dedupe + JIRA / Linear / PM-wizard fixes)#1194
Merged
zbigniewsobiecki merged 6 commits intomainfrom Apr 25, 2026
Merged
Conversation
…cation (#1183) * refactor(pm-wizard): extract generic hooks — eliminate 58% code duplication * fix(types): resolve type checking errors in pm-wizard-hooks.ts * fix(pm-wizard): address review feedback — remove dead code and fix Trello CF validation - Delete broken `useProviderDiscovery`, `DiscoveryConfig`, and `hasCredentials` (the hook constructed invalid capability strings like 'boardDetails' that do not exist in DISCOVERY_CAPABILITIES, causing guaranteed Zod 400 errors; per-provider discovery hooks are kept as intentionally correct) - Restore Trello custom field validation: add `containerError?` to `CustomFieldCreationConfig` and throw it in `useProviderCustomFieldCreation` when `containerId` is empty, mirroring `LabelCreationConfig`; pass the error message from `useTrelloCustomFieldCreation` so the silent `|| 'global'` fallback can no longer bypass the board-required check - Delete the deprecated one-liner `buildVerifyAuthArg` wrapper; call `buildProviderAuthArg` directly in `useVerification` - Update file-level docstring, imports (remove unused `DiscoveryCapability`), and bottom re-exports to reflect deleted symbols Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…on (#1188) * fix(linear): resolve UUID from parent labels for auto-label propagation * fix(agent-execution): scope non-UUID label warning to Linear only The UUID validation warning in propagateAutoLabelAfterSplitting was firing for Trello and JIRA projects in happy paths, causing log pollution. Trello uses 24-character MongoDB Object IDs and JIRA uses name strings — both are valid non-UUID formats for those providers. Scope the warn() call to `project.pm.type === 'linear'` only, and update the corresponding test comment to explain why no warning fires for Trello. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…1189) Co-authored-by: Cascade Bot <bot@cascade.dev>
…tructured envelope, --comment alias) (#1190) * docs(014): spec + plans for cascade-tools agent ergonomics Adds docs/specs/014-cascade-tools-agent-ergonomics.md plus two plans covering shared-infra and create-pr-review adoption. Prompted by prod run 5d993b04-6e05-4ae1-b7de-8c274cf3496b. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan-014): lock plan 1 (shared-infra) * feat(cascade-tools): plan 014/1 shared-infra — truthful prompts + envelope Ships the root-cause fix for prod run 5d993b04-6e05-4ae1-b7de-8c274cf3496b plus the shared infrastructure every future gadget inherits: - System-prompt renderer (src/backends/shared/nativeToolPrompts.ts) stops stripping trailing 's' from array param names and claiming '<string> (repeatable)' for every array. Array-of-object params now render as `--<flag> '<json>'` with aliases appended via `|` and a one-line runnable example from the tool definition. - Factory (src/gadgets/shared/cliCommandFactory.ts) gains oclif flag aliases, JSON parsing for array-of-object flags, file-input JSON parsing, `examples` wired into oclif `--help`, and Levenshtein-based 'did you mean' suggestions for mistyped flags (via fastest-levenshtein). - New shared error envelope (src/gadgets/shared/errorEnvelope.ts) — every CLI failure emits `{"success":false,"error":{type,flag?,message,got?, expected?,hint?,example?}}` on stdout plus a one-line prose summary on stderr. All prior `this.error()` / flat `{success:false,error:"<string>"}` call sites migrated. - Contracts widened: ParameterDefinition gains `cliAliases`, FileInput- Alternative gains `parseAs`, ToolManifest parameters carry `items`, `aliases`, `example`. - Manifest generator threads the new fields through. - bin/cascade-tools.js wraps `run()` to swallow oclif ExitError cleanly so the envelope isn't obscured by Node's default stack dump. Plan-1 ACs #1–#17 all delivered. 8438/8438 unit tests passing. Test surface delta: 57 new unit tests across errorEnvelope.test.ts, shared-nativeToolPrompts.test.ts, and factories.test.ts. Seven legacy assertions encoding the pre-014 error surface updated in cli/cli-command- factory, cli/file-input-flags, cli/scm/create-pr-sidecar, cli/scm/create- pr-review-sidecar, backends/claude-code. Plan 2 adopts the pattern on createPRReviewDef — zero shared-file edits — proving the declarative-metadata invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(plan-014): lock plan 2 (createprreview-adopt) * feat(cascade-tools): plan 014/2 createprreview-adopt + spec done Applies the spec-014 declarative-metadata pattern to createPRReviewDef: - --comment alias for --comments (the exact muscle-memory mistake from prod run 5d993b04-6e05-4ae1-b7de-8c274cf3496b). - --comments-file <path> (and - for stdin) JSON-parsed escape hatch for long payloads that don't survive shell quoting. - Two declarative fields on createPRReviewDef.parameters.comments.cliAliases + createPRReviewDef.cli.fileInputAlternatives. Zero edits to shared infrastructure (cliCommandFactory, manifestGenerator, nativeToolPrompts, errorEnvelope) — proves spec 014's single-entrypoint invariant. Per-plan ACs #1, #2, #3, #5, #6, #7, #8, #9, #11, #12 auto-verified (unit tests + build + lint + typecheck). AC #4 (binary-level smoke) tagged [manual] because vitest fork-pool workers fail to capture stdout/stderr from spawned binaries that do top-level await import(); the six scenarios were verified manually against the built binary and the trace is recorded in the plan. AC #10 n/a — integration test path abandoned for the same reason. All plans done. Spec 014 marked .done (docs/specs/014-*.md → .done). CHANGELOG Unreleased updated with a per-plan entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a worker container exits non-zero, the router previously logged
only `statusCode` and stamped the run record's `error` field with the
generic `Worker crashed with exit code N`. That collapsed every kind
of failure — cgroup OOM, internal SIGKILL, codex CLI crash, runtime
abort — into the same opaque message. Investigating ucho's recent
exit-137 runs required ssh + syslog grep on the bauer host because
cascade itself had discarded the diagnostic signal.
This commit captures the signal at the source: `container.inspect()`
runs immediately after `wait()` (before AutoRemove or our manual
`removeContainer` reaps the container) and pulls `State.OOMKilled`,
`State.Error`, and the actual `StartedAt → FinishedAt` duration from
Docker's own clocks. The crash reason on the run record is now a
structured, grep-stable string:
Worker crashed with exit code 137 · OOMKilled=true · reason="Out of memory"
`OOMKilled=true` is the *definitive* cgroup-OOM signal; a 137 exit
*without* it means the kill came from inside the container or from a
non-cgroup signal, not memory. Future post-mortems get the answer
from `cascade runs show <id> --json` instead of the host's syslog.
Also: `[WorkerManager] Resolved spawn settings` is now emitted at
every spawn with both `projectWatchdogTimeoutMs` and
`globalWorkerTimeoutMs` so the "did the per-project override actually
win?" question is one log query away. This was a real load-bearing
unknown during the ucho exit-137 investigation — the project config
said 45 min but production behavior matched the global 30 min env
default.
Surfaces:
* `src/router/active-workers.ts` — new `ExitDetails` type +
`formatCrashReason(exitCode, details?)` helper. `cleanupWorker()`
takes an optional third `details` arg and packs the diagnostics
into `failOrphanedRun` / `failOrphanedRunFallback`'s reason string.
Existing callers that don't pass details fall back to the bare
message.
* `src/router/container-manager.ts` — `inspectExitedContainer()`
reads Docker's `State` immediately post-wait. `logWorkerTail()` and
`onWorkerExit()` extracted from the wait callback for complexity.
`resolveSpawnSettings()` now emits the structured `Resolved spawn
settings` log. Defensive: malformed/sentinel timestamps yield
`durationMs: undefined` rather than `NaN` leaking into Sentry.
* `CLAUDE.md` — new "Worker exit diagnostics" paragraph documenting
the format and the load-bearing logs.
* `AGENTS.md` — symlinked to `CLAUDE.md`. The previous untracked
copy was stale (broken `Codex setup-token` / `~/.Codex.json`
artefacts from a botched search-replace, missing the work-item
concurrency lock + post-completion review dispatch sections, and
outdated integration text). One source of truth.
Tests:
* `tests/unit/router/active-workers.test.ts` — extended with 6 new
cases pinning the OOMKilled / exitReason fields onto the failOrphan
reason string (workItem path + fallback path).
* `tests/unit/router/container-manager-diagnostics.test.ts` (new) —
24 direct tests across three suites:
* `formatCrashReason` (8): bare/oom/reason permutations + grep-
stability regression for the `· ` separator and `OOMKilled=…`
marker. The format is now de-facto API for any future dashboard
parser; bumping it silently fails CI.
* `inspectExitedContainer` (12): OOMKilled true/false, exitReason
extraction, durationMs from real timestamps + Docker's
`0001-01-01` sentinel + malformed strings + missing-half +
inverted (negative-span) timestamps + inspect-rejection path.
Diagnostics are best-effort: rejection logs a warn and returns
all-undefined, never throws.
* `resolveSpawnSettings` (4): per-project watchdogTimeoutMs override
applied → 45+2 = 47 min; no project override → falls through to
global; null projectId → no log + no `loadProjectConfig` call;
Math.round on non-integer minutes.
* `tests/unit/router/snapshot-integration.test.ts` — `setupMockContainer`
now stubs `inspect()` so the post-exit pipeline runs through cleanly
in the snapshot tests (production code no longer needs to defend
against a test-mock gap).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iagnostics feat(router): capture OOMKilled + exit reason on worker exits
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Promotes 5 dev commits to main:
feat(router): capture OOMKilled + exit reason on worker exits #1193 —
feat(router): capture OOMKilled + exit reason on worker exitsReplaces opaque
Worker crashed with exit code Nwith structuredWorker crashed with exit code 137 · OOMKilled=true · reason="…",sourced from Docker's
container.inspect()post-wait(). Adds[WorkerManager] Resolved spawn settingslog so per-projectwatchdogTimeoutMsoverrides are observably distinguishable fromthe global default. 30 new tests pin the format and the spawn-
settings diagnostic. Unblocks future exit-137 post-mortems without
ssh+syslog.
feat(cascade-tools): spec 014 — agent ergonomics (truthful prompts, structured envelope, --comment alias) #1190 —
feat(cascade-tools): spec 014 — agent ergonomicsTruthful prompts, structured envelope,
--commentalias.feat(agents): deduplicate CLAUDE.md and AGENTS.md in agent context #1189 —
feat(agents): deduplicate CLAUDE.md and AGENTS.md in agent contextfix(linear): resolve UUID from parent labels for auto-label propagation #1188 —
fix(linear): resolve UUID from parent labels for auto-label propagationrefactor(pm-wizard): extract generic hooks — eliminate 58% code duplication #1183 —
refactor(pm-wizard): extract generic hooks — eliminate 58% code duplicationTest plan
🤖 Generated with Claude Code