Skip to content

fix(cloud-agent-next): destroy sandbox on prep 500#3081

Merged
eshurakov merged 1 commit intomainfrom
session/agent_8c28e523-5f39-4d55-a2dd-c5184978ed83
May 7, 2026
Merged

fix(cloud-agent-next): destroy sandbox on prep 500#3081
eshurakov merged 1 commit intomainfrom
session/agent_8c28e523-5f39-4d55-a2dd-c5184978ed83

Conversation

@kilo-code-bot
Copy link
Copy Markdown
Contributor

@kilo-code-bot kilo-code-bot Bot commented May 6, 2026

Summary

  • Destroy cloud-agent-next sandboxes when Sandbox SDK/internal 500s occur during workspace preparation.
  • Preserve sandbox 500 errors through disk-check and clone paths so recovery can identify unhealthy containers.
  • Tighten 500 message regexes with \b to avoid matching 5000/50012/etc.
  • Skip cleanupWorkspace on cold-start resume failure when the sandbox was already destroyed (exec against a destroyed sandbox would emit misleading warnings).
  • Drop the redundant WRAPPER_START_FAILED classification branch; the trailing `cause` recursion already covers it.
  • Add sandbox_500_detected / sandbox_500_destroyed / sandbox_500_destroy_failed logTags so we can measure hit rate, destroy success, and per-phase distribution in Axiom.
  • Add regression coverage for sandbox 500 classification and async preparation cleanup.

Verification

  • Manually verified the new logTags appear in test stdout via the existing async-preparation regression test.
  • Sample Axiom queries (run after deploy):
    • Hit rate per phase: `where logTag in ("sandbox_500_detected","sandbox_500_destroyed","sandbox_500_destroy_failed") | summarize count() by logTag, phase, bin(_time, 1h)`
    • Unique sessions affected: `where logTag == "sandbox_500_detected" | summarize dcount(sessionId) by bin(_time, 1d), phase`
    • Destroy success rate: `countif(logTag=="sandbox_500_destroyed") / (destroyed+failed)`

Visual Changes

N/A

Reviewer Notes

  • Cold-start resume in `session-service.ts` is the one call site that uses `destroySandboxAfterInternalServerError` directly (not via `withSandboxInternalServerErrorRecovery`); the new `sandboxDestroyed` gate keeps its retry semantics intact.
  • We deliberately did not add a single retry before destroy: same-sandbox retries have low success for control-plane 500s where the container is wedged, and destroy + fresh container is more reliable.
  • LogTags are the cheapest path to measurement (zero new infra). If we want proper time-series dashboards later, we can layer Cloudflare Analytics Engine on top without changing the recovery code.

@eshurakov eshurakov marked this pull request as ready for review May 6, 2026 20:42
Comment thread services/cloud-agent-next/src/sandbox-recovery.ts
@kilo-code-bot
Copy link
Copy Markdown
Contributor Author

kilo-code-bot Bot commented May 6, 2026

Code Review Summary

Status: 1 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
services/cloud-agent-next/src/workspace.ts 138 Stale workspace cleanup can still swallow sandbox 500s, preventing the new recovery path from destroying an unhealthy sandbox during low-disk cleanup.
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File Line Issue
N/A N/A N/A
Files Reviewed (9 files)
  • services/cloud-agent-next/src/execution/orchestrator.ts - 0 issues
  • services/cloud-agent-next/src/persistence/async-preparation.test.ts - 0 issues
  • services/cloud-agent-next/src/persistence/async-preparation.ts - 0 issues
  • services/cloud-agent-next/src/router/handlers/session-prepare.ts - 0 issues
  • services/cloud-agent-next/src/sandbox-recovery.test.ts - 0 issues
  • services/cloud-agent-next/src/sandbox-recovery.ts - 0 new issues
  • services/cloud-agent-next/src/session-service.ts - 0 issues
  • services/cloud-agent-next/src/workspace.test.ts - 0 issues
  • services/cloud-agent-next/src/workspace.ts - 1 issue

Fix these issues in Kilo Cloud


Reviewed by gpt-5.5-20260423 · 1,153,037 tokens

@eshurakov eshurakov force-pushed the session/agent_8c28e523-5f39-4d55-a2dd-c5184978ed83 branch from 2b20bf4 to b489a20 Compare May 7, 2026 07:36
- Tighten 500 regexes with \b to avoid matching 5000/50012/etc.
- Walk cause chain in recovery classifier so wrapped SandboxError 500s
  are still detected (mkdir error wrappers previously stripped identity,
  preventing destroy on control-plane 500s)
- Skip cleanupWorkspace on cold-start resume when sandbox was destroyed
  (exec against a destroyed sandbox is a no-op that emits misleading warnings)
- Drop redundant WRAPPER_START_FAILED branch; trailing cause recursion
  already covers it
- Add sandbox_500_detected/destroyed/destroy_failed logTags to measure
  hit rate and destroy success per phase in Axiom
@eshurakov eshurakov force-pushed the session/agent_8c28e523-5f39-4d55-a2dd-c5184978ed83 branch from b489a20 to 164cf55 Compare May 7, 2026 07:41
Comment thread services/cloud-agent-next/src/workspace.ts
@eshurakov eshurakov merged commit 1ba0436 into main May 7, 2026
13 checks passed
@eshurakov eshurakov deleted the session/agent_8c28e523-5f39-4d55-a2dd-c5184978ed83 branch May 7, 2026 07:46
try {
const result = await WrapperClient.ensureWrapper(sandbox, prepared.session, {
agentSessionId: sessionId,
userId,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could sandbox 500s from image attachment downloads be swallowed here? downloadImagesToSandbox catches per-file session.exec failures and this rethrows WORKSPACE_SETUP_FAILED without the original cause, so the recovery wrapper may not destroy an unhealthy sandbox.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants