Skip to content

chore: promote dev → main (worker exit diagnostics + agent docs dedupe + JIRA / Linear / PM-wizard fixes)#1194

Merged
zbigniewsobiecki merged 6 commits intomainfrom
dev
Apr 25, 2026
Merged

chore: promote dev → main (worker exit diagnostics + agent docs dedupe + JIRA / Linear / PM-wizard fixes)#1194
zbigniewsobiecki merged 6 commits intomainfrom
dev

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Summary

Promotes 5 dev commits to main:

Test plan

  • CI green on each underlying PR before merge to dev.
  • CI green on dev post-merge (CI 4m, Build and Deploy (Dev) 2m48s, CodeQL 2m32s).
  • Build and Deploy (Prod) green after this merges to main.

🤖 Generated with Claude Code

aaight and others added 6 commits April 24, 2026 18:02
…cation (#1183)

* refactor(pm-wizard): extract generic hooks — eliminate 58% code duplication

* fix(types): resolve type checking errors in pm-wizard-hooks.ts

* fix(pm-wizard): address review feedback — remove dead code and fix Trello CF validation

- Delete broken `useProviderDiscovery`, `DiscoveryConfig`, and `hasCredentials`
  (the hook constructed invalid capability strings like 'boardDetails' that do not
  exist in DISCOVERY_CAPABILITIES, causing guaranteed Zod 400 errors; per-provider
  discovery hooks are kept as intentionally correct)
- Restore Trello custom field validation: add `containerError?` to
  `CustomFieldCreationConfig` and throw it in `useProviderCustomFieldCreation` when
  `containerId` is empty, mirroring `LabelCreationConfig`; pass the error message
  from `useTrelloCustomFieldCreation` so the silent `|| 'global'` fallback can no
  longer bypass the board-required check
- Delete the deprecated one-liner `buildVerifyAuthArg` wrapper; call
  `buildProviderAuthArg` directly in `useVerification`
- Update file-level docstring, imports (remove unused `DiscoveryCapability`), and
  bottom re-exports to reflect deleted symbols

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Cascade Bot <bot@cascade.dev>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…on (#1188)

* fix(linear): resolve UUID from parent labels for auto-label propagation

* fix(agent-execution): scope non-UUID label warning to Linear only

The UUID validation warning in propagateAutoLabelAfterSplitting was firing
for Trello and JIRA projects in happy paths, causing log pollution. Trello
uses 24-character MongoDB Object IDs and JIRA uses name strings — both are
valid non-UUID formats for those providers.

Scope the warn() call to `project.pm.type === 'linear'` only, and update
the corresponding test comment to explain why no warning fires for Trello.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Cascade Bot <bot@cascade.dev>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…tructured envelope, --comment alias) (#1190)

* docs(014): spec + plans for cascade-tools agent ergonomics

Adds docs/specs/014-cascade-tools-agent-ergonomics.md plus two plans
covering shared-infra and create-pr-review adoption. Prompted by prod
run 5d993b04-6e05-4ae1-b7de-8c274cf3496b.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(plan-014): lock plan 1 (shared-infra)

* feat(cascade-tools): plan 014/1 shared-infra — truthful prompts + envelope

Ships the root-cause fix for prod run 5d993b04-6e05-4ae1-b7de-8c274cf3496b
plus the shared infrastructure every future gadget inherits:

- System-prompt renderer (src/backends/shared/nativeToolPrompts.ts) stops
  stripping trailing 's' from array param names and claiming '<string>
  (repeatable)' for every array. Array-of-object params now render as
  `--<flag> '<json>'` with aliases appended via `|` and a one-line runnable
  example from the tool definition.
- Factory (src/gadgets/shared/cliCommandFactory.ts) gains oclif flag aliases,
  JSON parsing for array-of-object flags, file-input JSON parsing, `examples`
  wired into oclif `--help`, and Levenshtein-based 'did you mean' suggestions
  for mistyped flags (via fastest-levenshtein).
- New shared error envelope (src/gadgets/shared/errorEnvelope.ts) — every
  CLI failure emits `{"success":false,"error":{type,flag?,message,got?,
  expected?,hint?,example?}}` on stdout plus a one-line prose summary on
  stderr. All prior `this.error()` / flat `{success:false,error:"<string>"}`
  call sites migrated.
- Contracts widened: ParameterDefinition gains `cliAliases`, FileInput-
  Alternative gains `parseAs`, ToolManifest parameters carry `items`,
  `aliases`, `example`.
- Manifest generator threads the new fields through.
- bin/cascade-tools.js wraps `run()` to swallow oclif ExitError cleanly so
  the envelope isn't obscured by Node's default stack dump.

Plan-1 ACs #1#17 all delivered. 8438/8438 unit tests passing.

Test surface delta: 57 new unit tests across errorEnvelope.test.ts,
shared-nativeToolPrompts.test.ts, and factories.test.ts. Seven legacy
assertions encoding the pre-014 error surface updated in cli/cli-command-
factory, cli/file-input-flags, cli/scm/create-pr-sidecar, cli/scm/create-
pr-review-sidecar, backends/claude-code.

Plan 2 adopts the pattern on createPRReviewDef — zero shared-file edits —
proving the declarative-metadata invariant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(plan-014): lock plan 2 (createprreview-adopt)

* feat(cascade-tools): plan 014/2 createprreview-adopt + spec done

Applies the spec-014 declarative-metadata pattern to createPRReviewDef:

- --comment alias for --comments (the exact muscle-memory mistake from
  prod run 5d993b04-6e05-4ae1-b7de-8c274cf3496b).
- --comments-file <path> (and - for stdin) JSON-parsed escape hatch for
  long payloads that don't survive shell quoting.
- Two declarative fields on createPRReviewDef.parameters.comments.cliAliases
  + createPRReviewDef.cli.fileInputAlternatives. Zero edits to shared
  infrastructure (cliCommandFactory, manifestGenerator, nativeToolPrompts,
  errorEnvelope) — proves spec 014's single-entrypoint invariant.

Per-plan ACs #1, #2, #3, #5, #6, #7, #8, #9, #11, #12 auto-verified
(unit tests + build + lint + typecheck). AC #4 (binary-level smoke)
tagged [manual] because vitest fork-pool workers fail to capture
stdout/stderr from spawned binaries that do top-level await import();
the six scenarios were verified manually against the built binary and
the trace is recorded in the plan. AC #10 n/a — integration test path
abandoned for the same reason.

All plans done. Spec 014 marked .done (docs/specs/014-*.md → .done).
CHANGELOG Unreleased updated with a per-plan entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a worker container exits non-zero, the router previously logged
only `statusCode` and stamped the run record's `error` field with the
generic `Worker crashed with exit code N`. That collapsed every kind
of failure — cgroup OOM, internal SIGKILL, codex CLI crash, runtime
abort — into the same opaque message. Investigating ucho's recent
exit-137 runs required ssh + syslog grep on the bauer host because
cascade itself had discarded the diagnostic signal.

This commit captures the signal at the source: `container.inspect()`
runs immediately after `wait()` (before AutoRemove or our manual
`removeContainer` reaps the container) and pulls `State.OOMKilled`,
`State.Error`, and the actual `StartedAt → FinishedAt` duration from
Docker's own clocks. The crash reason on the run record is now a
structured, grep-stable string:

    Worker crashed with exit code 137 · OOMKilled=true · reason="Out of memory"

`OOMKilled=true` is the *definitive* cgroup-OOM signal; a 137 exit
*without* it means the kill came from inside the container or from a
non-cgroup signal, not memory. Future post-mortems get the answer
from `cascade runs show <id> --json` instead of the host's syslog.

Also: `[WorkerManager] Resolved spawn settings` is now emitted at
every spawn with both `projectWatchdogTimeoutMs` and
`globalWorkerTimeoutMs` so the "did the per-project override actually
win?" question is one log query away. This was a real load-bearing
unknown during the ucho exit-137 investigation — the project config
said 45 min but production behavior matched the global 30 min env
default.

Surfaces:

* `src/router/active-workers.ts` — new `ExitDetails` type +
  `formatCrashReason(exitCode, details?)` helper. `cleanupWorker()`
  takes an optional third `details` arg and packs the diagnostics
  into `failOrphanedRun` / `failOrphanedRunFallback`'s reason string.
  Existing callers that don't pass details fall back to the bare
  message.
* `src/router/container-manager.ts` — `inspectExitedContainer()`
  reads Docker's `State` immediately post-wait. `logWorkerTail()` and
  `onWorkerExit()` extracted from the wait callback for complexity.
  `resolveSpawnSettings()` now emits the structured `Resolved spawn
  settings` log. Defensive: malformed/sentinel timestamps yield
  `durationMs: undefined` rather than `NaN` leaking into Sentry.
* `CLAUDE.md` — new "Worker exit diagnostics" paragraph documenting
  the format and the load-bearing logs.
* `AGENTS.md` — symlinked to `CLAUDE.md`. The previous untracked
  copy was stale (broken `Codex setup-token` / `~/.Codex.json`
  artefacts from a botched search-replace, missing the work-item
  concurrency lock + post-completion review dispatch sections, and
  outdated integration text). One source of truth.

Tests:

* `tests/unit/router/active-workers.test.ts` — extended with 6 new
  cases pinning the OOMKilled / exitReason fields onto the failOrphan
  reason string (workItem path + fallback path).
* `tests/unit/router/container-manager-diagnostics.test.ts` (new) —
  24 direct tests across three suites:
  * `formatCrashReason` (8): bare/oom/reason permutations + grep-
    stability regression for the `· ` separator and `OOMKilled=…`
    marker. The format is now de-facto API for any future dashboard
    parser; bumping it silently fails CI.
  * `inspectExitedContainer` (12): OOMKilled true/false, exitReason
    extraction, durationMs from real timestamps + Docker's
    `0001-01-01` sentinel + malformed strings + missing-half +
    inverted (negative-span) timestamps + inspect-rejection path.
    Diagnostics are best-effort: rejection logs a warn and returns
    all-undefined, never throws.
  * `resolveSpawnSettings` (4): per-project watchdogTimeoutMs override
    applied → 45+2 = 47 min; no project override → falls through to
    global; null projectId → no log + no `loadProjectConfig` call;
    Math.round on non-integer minutes.
* `tests/unit/router/snapshot-integration.test.ts` — `setupMockContainer`
  now stubs `inspect()` so the post-exit pipeline runs through cleanly
  in the snapshot tests (production code no longer needs to defend
  against a test-mock gap).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iagnostics

feat(router): capture OOMKilled + exit reason on worker exits
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

@zbigniewsobiecki zbigniewsobiecki merged commit 022bae8 into main Apr 25, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants