Problem
Live desks can leave an OpenClaw container stuck in a bad state where the process remains up, Docker does not restart it, and scheduled invocations begin accumulating failures.
Current failure chain:
- OpenClaw startup can be slow enough that the scheduler reaches it before the runner is actually healthy.
claw-api currently treats wake execution as an immediate blocking action and records wake failures when the target is not yet healthy.
- The OpenClaw driver's compose healthcheck and
claw health probe shell into openclaw health --json, but a persistently unhealthy gateway does not automatically exit the container, so Docker restart policy never fires.
PostApply only checks that the container is running, not that the gateway has actually become healthy.
Scope
- Make
claw-api health-aware before waking a managed service so startup lag does not count as a schedule failure.
- Improve OpenClaw startup verification so
claw up -d waits for real liveness, not just a running PID.
- Add a liveness mechanism that turns persistent OpenClaw health failure into container exit so Docker restart policy can recover the service.
- Clarify ownership of runner-native cron state so agents do not treat OpenClaw-native cron as durable infrastructure state.
Likely files
cmd/claw-api/scheduler.go
cmd/claw-api/main.go or adjacent runtime loop code
internal/driver/openclaw/driver.go
internal/driver/openclaw/baseimage.go
internal/driver/shared/clawdapus_md.go
Verification
- targeted Go tests for scheduler health-gating and OpenClaw post-apply/health behavior
- integration/spike coverage showing startup lag is skipped rather than counted as failure
- verification that a persistently unhealthy OpenClaw gateway causes container restart under the existing restart policy
Problem
Live desks can leave an OpenClaw container stuck in a bad state where the process remains up, Docker does not restart it, and scheduled invocations begin accumulating failures.
Current failure chain:
claw-apicurrently treats wake execution as an immediate blocking action and records wake failures when the target is not yet healthy.claw healthprobe shell intoopenclaw health --json, but a persistently unhealthy gateway does not automatically exit the container, so Docker restart policy never fires.PostApplyonly checks that the container is running, not that the gateway has actually become healthy.Scope
claw-apihealth-aware before waking a managed service so startup lag does not count as a schedule failure.claw up -dwaits for real liveness, not just a running PID.Likely files
cmd/claw-api/scheduler.gocmd/claw-api/main.goor adjacent runtime loop codeinternal/driver/openclaw/driver.gointernal/driver/openclaw/baseimage.gointernal/driver/shared/clawdapus_md.goVerification