Skip to content

perf: optimise sandbox boot time with parallel startup and symlinked deps#506

Closed
ColeMurray wants to merge 2 commits intomainfrom
perf/boot-time-optimizations
Closed

perf: optimise sandbox boot time with parallel startup and symlinked deps#506
ColeMurray wants to merge 2 commits intomainfrom
perf/boot-time-optimizations

Conversation

@ColeMurray
Copy link
Copy Markdown
Owner

@ColeMurray ColeMurray commented Apr 19, 2026

Summary

Optimises container boot time for sandbox environments, validated in production with measured results.

Pre-stage .opencode/ in Docker image — assembles tools, skills, deps, and node_modules into /app/opencode-stage/ at image build time. At boot, node_modules is symlinked (instant) instead of shutil.copytree (~2.3s), and the remaining small files are shallow-copied.

Parallel startup for repo_image boots — OpenCode and sidecars (code-server, ttyd) are started concurrently with git sync and hooks via asyncio.create_task, overlapping ~2.8s of Node.js cold start with work that's already happening. The bridge still starts last, so prompts can't arrive before git sync and hooks complete.

Health check poll interval reduced from 500ms to 100ms, saving ~200ms on average.

Measured results (production repo_image boot)

Metric Before After
opencode.start → file install complete 2,271ms 27ms
Health check poll overhead up to 500ms ~100ms
Total opencode.startopencode.ready ~4,500ms 2,877ms

The parallel startup change (OpenCode + sidecars concurrent with git sync) will further reduce end-to-end boot time by hiding OpenCode's ~2.8s Node.js cold start behind the git fetch.

Also included

  • Extract _start_sidecars() method for parallel execution
  • Legacy fallback (_install_tools_legacy) for pre-v47 images
  • Add .claude/worktrees/ to .gitignore and ESLint ignores
  • Add packages/sandbox-runtime/**/*.py to lint-staged ruff checks

Test plan

  • 269 sandbox-runtime tests pass (including updated test_start_script_failure_is_fatal)
  • 120 modal-infra tests pass
  • Ruff lint + format clean (from package directory)
  • ESLint + Prettier clean
  • Validated in production — opencode.stage_installed event confirms fast path active

Summary by CodeRabbit

Release Notes

  • Performance & Infrastructure
    • Optimized sandbox runtime startup process with improved file staging efficiency
    • Enhanced health monitoring responsiveness through faster polling intervals
    • Improved development tool integration and service availability within sandbox environments

…deps

- Pre-stage .opencode/ layout in Docker image (tools, skills, deps,
  node_modules) so boot avoids per-file copy from scattered sources
- Symlink node_modules instead of shutil.copytree (~2.3s → 27ms)
- Start OpenCode and sidecars concurrently with git sync for
  repo_image boots, overlapping ~2.8s of Node.js cold start
- Reduce health check poll interval from 500ms to 100ms
- Extract _start_sidecars() for parallel execution
- Add .claude/worktrees/ to .gitignore and ESLint ignores
- Add sandbox-runtime to lint-staged ruff checks
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 19, 2026

📝 Walkthrough

Walkthrough

Infrastructure updates to optimize sandbox image builds and runtime startup. Pre-stages OpenCode dependencies into the Docker image layer, introduces a fast-path installation flow in the runtime with legacy fallback, implements concurrent sidecar startup, and adds Python linting for the sandbox-runtime package.

Changes

Cohort / File(s) Summary
Configuration & Ignores
.gitignore, eslint.config.js
Excludes .claude/worktrees/ directory from Git tracking and ESLint processing.
Lint Staging
package.json
Adds Python linting (ruff check --fix and ruff format) for packages/sandbox-runtime/**/*.py in staged-file pipeline.
Image Build Optimization
packages/modal-infra/src/images/base.py
Updates CACHE_BUSTER to v47-prestage-opencode. Embeds sandbox runtime into image and pre-stages .opencode-compatible directory structure (/app/opencode-stage/) containing tool/skill files, manifests, and node_modules symlink, enabling fast-path runtime installation.
Runtime Startup Refactoring
packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py
Introduces fast-path installation (_install_from_stage()) when pre-staged directory exists; falls back to legacy tool/deps installation (_install_tools_legacy()) otherwise. Adds _start_sidecars() to launch code_server and ttyd concurrently, with code_server and OpenCode started concurrently in from_repo_image mode. Reduces health-check polling interval from 0.5s to 0.1s.

Sequence Diagram

sequenceDiagram
    participant DockerBuild as Docker Build
    participant ImageLayer as Image Layer
    participant RuntimeStart as Runtime Startup
    participant FastPath as Fast Install (Stage)
    participant LegacyPath as Legacy Install
    participant Sidecars as Sidecars<br/>(code_server, ttyd)

    DockerBuild->>ImageLayer: Pre-stage /app/opencode-stage/<br/>(tools, skills, node_modules)
    ImageLayer->>ImageLayer: Cache v47-prestage-opencode

    RuntimeStart->>RuntimeStart: Check for /app/opencode-stage/
    alt Stage Directory Exists
        RuntimeStart->>FastPath: Call _install_from_stage()
        FastPath->>FastPath: Copy manifests + symlink node_modules
        FastPath->>RuntimeStart: ✓ Fast path complete
    else Stage Directory Missing
        RuntimeStart->>LegacyPath: Call _install_tools_legacy()
        LegacyPath->>LegacyPath: Copy tools + deps per-file
        LegacyPath->>RuntimeStart: ✓ Legacy path complete
    end

    par Concurrent Startup
        RuntimeStart->>Sidecars: _start_sidecars() (concurrent)
        RuntimeStart->>RuntimeStart: start_opencode() (concurrent)
    end

    Sidecars->>Sidecars: code_server + ttyd running
    RuntimeStart->>RuntimeStart: OpenCode initialized
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

Poem

🐰 A rabbit's refrain:

Pre-staged deps now baked in the layer,
Fast paths emerge, no need to despair!
Sidecars run wild, concurrent and spry,
Cache-busting dreams reach toward the sky. 🚀
Fallback paths whisper, "We're here if you need,"
Sandbox now sprints with optimized speed!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main optimization focus: parallel startup and symlinked dependencies for improved sandbox boot performance.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/boot-time-optimizations

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

Terraform Validation Results

Step Status
Format
Init
Validate

Note: Terraform plan was skipped because secrets are not configured. This is expected for external contributors. See docs/GETTING_STARTED.md for setup instructions.

Pushed by: @ColeMurray, Action: pull_request

opencode_task: asyncio.Task | None = None
sidecar_task: asyncio.Task | None = None
if from_repo_image:
opencode_task = asyncio.create_task(self.start_opencode())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting start_opencode() before _update_existing_repo() now creates .opencode/ inside the repo before git checkout -B runs. That can make checkout fail with "untracked working tree files would be overwritten" for any repository that tracks .opencode/* (or adds it on the target branch), which is a regression from the previous ordering. Could we avoid writing into the repo until after checkout succeeds, or move the staged OpenCode files outside the repo root?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. In repo_image mode _install_from_stage() writes into workdir/.opencode/ which could race with git checkout -B. Moved opencode_task to after git sync + hooks complete — we now only overlap OpenCode startup with the sidecar wait. Fixed in 9a2a313.

opencode_task: asyncio.Task | None = None
sidecar_task: asyncio.Task | None = None
if from_repo_image:
opencode_task = asyncio.create_task(self.start_opencode())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reorders the runtime contract: run_start_script() used to complete before OpenCode booted, but in the new repo-image path OpenCode now starts concurrently and may read the repo or .opencode state before the start hook has finished preparing it. That means hook side effects like generating config, writing files, or starting local services are no longer guaranteed to be visible during OpenCode startup. Can we keep the overlap to git/sidecars while preserving the start.sh -> OpenCode ordering?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the start hook contract should be preserved. Same fix: OpenCode now starts after hooks complete, overlapping only with sidecars. Fixed in 9a2a313.

Copy link
Copy Markdown
Contributor

@open-inspect open-inspect Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR improves sandbox boot performance by pre-staging .opencode/, reducing health-check latency, and overlapping startup work. The image-stage changes look reasonable, but the new repo-image parallel startup path introduces two blocking correctness regressions around git checkout safety and the established start-hook ordering.

  • PR Title and number: perf: optimise sandbox boot time with parallel startup and symlinked deps (#506)
  • Author: @ColeMurray
  • Files changed count, additions/deletions: 6 files, +131/-36

Critical Issues

  • [Functionality & Correctness] packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py:1246 - start_opencode() now writes .opencode/ into the repo before _update_existing_repo() runs git checkout -B. That can cause checkout failures when the target branch contains .opencode/*, because git will refuse to overwrite the newly-created untracked files.
  • [Functionality & Correctness] packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py:1246 - The repo-image path now starts OpenCode before run_start_script() completes. That changes the previous contract where the runtime hook finished preparing the repo before OpenCode booted, so hook-created files, generated config, or local services are no longer guaranteed to be available during OpenCode startup.

Suggestions

  • None beyond the blocking issues above.

Nitpicks

  • None.

Positive Feedback

  • The staged .opencode layout is a nice minimal optimization, especially keeping the legacy fallback for pre-v47 images.
  • Extracting _start_sidecars() makes the lifecycle easier to follow and reduces duplication.
  • The lint-staged and ignore-list updates are small but helpful hygiene improvements alongside the runtime work.

Questions

  • None.

Verdict

Request Changes: Critical correctness issues in the new parallel repo-image startup path should be addressed before merge.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py (1)

1240-1299: ⚠️ Potential issue | 🔴 Critical

Cancel and drain background tasks on error to prevent event loop leaks and unhandled exceptions.

When run_start_script() fails or any earlier exception is raised, control jumps to the except block and then finally: await self.shutdown(). The eagerly created opencode_task and sidecar_task are never cancelled or awaited, causing:

  • Python to emit Task exception was never retrieved for any exception inside start_opencode() (e.g., _wait_for_health timing out).
  • If start_opencode() is still running, shutdown() terminates self.opencode_process while the task concurrently awaits the health endpoint — the task continues until timeout/connection-refused, leaving the event loop alive after run() returns.
  • opencode.ready may never get set but the error report path doesn't include state about whether OpenCode came up.

Implement the suggested cleanup in the finally block to cancel and drain both tasks before shutdown.

Additionally, in the from_repo_image=True boot path, start_opencode() calls _install_tools() which writes to .opencode/tool/, .opencode/skills/, etc., while _update_existing_repo() concurrently executes git checkout -B on the same repository. If any tracked file under .opencode/ exists on the target branch (unlikely but possible), this creates a race condition. Verify that .opencode/ is guaranteed gitignored in all repositories using this sandbox.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py` around lines 1240
- 1299, The created background tasks opencode_task and sidecar_task must be
cancelled and drained in the run() finally path to avoid leaked tasks/unhandled
exceptions: in the finally block, if opencode_task or sidecar_task is not None
and not done(), call cancel(), then await them (e.g., with asyncio.gather(...,
return_exceptions=True)) to retrieve and swallow expected CancelledError/other
exceptions before calling await self.shutdown(); also ensure any exceptions from
those tasks are logged or attached to the error report (include opencode.ready
state). For the from_repo_image path, either ensure .opencode/ is guaranteed
gitignored in all repos or change the startup ordering (do not call
start_opencode() until after _update_existing_repo()/git sync completes) to
avoid concurrent writes vs git checkout races.
🧹 Nitpick comments (1)
packages/sandbox-runtime/tests/test_entrypoint_build_mode.py (1)

255-277: Optional: pin the new eager-start contract with a positive assertion.

The updated docstring states start_opencode is now fired eagerly, but the test no longer asserts this. Adding supervisor.start_opencode.assert_called_once() (or .assert_called()) would lock in the concurrent-start invariant so a future refactor that accidentally reverts to sequential startup is caught here rather than only in production. Not blocking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/sandbox-runtime/tests/test_entrypoint_build_mode.py` around lines
255 - 277, Update the test_start_script_failure_is_fatal to assert that
start_opencode was invoked eagerly: after awaiting supervisor.run() add
supervisor.start_opencode.assert_called_once() (or .assert_called()) alongside
the existing fatal-error and start_bridge assertions so the test verifies the
concurrent-start contract for the supervisor object and its start_opencode
method.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/modal-infra/src/images/base.py`:
- Around line 183-195: The shell fallbacks "|| true" in the .run_commands(...)
block mask missing glob matches for "/app/sandbox_runtime/tools/*.js" and
"/app/sandbox_runtime/skills/*" so staging can appear present while empty;
remove the "|| true" from those cp commands so the build fails loudly when
expected inputs are missing, or alternatively add explicit existence checks
before copying (checking the glob expands or directory is non-empty) so missing
tools/skills cause an error; also consider adding a runtime check in
_install_from_stage() to verify stage_dir/tool and stage_dir/skills are
non-empty and call _install_tools_legacy() if they are empty.

In `@packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py`:
- Around line 311-341: The fast-path installer _install_from_stage currently
performs symlink_to, shutil.copy2 and shutil.copytree calls without error
handling and unconditionally logs skills_path; wrap the sequence of filesystem
ops in a try/except around the sections that create
local_modules/local_tools/local_skills (references: _install_from_stage,
staged_modules/local_modules, shutil.copy2, shutil.copytree, symlink_to) so that
on any exception you (a) log the exception with details, (b) clean up partial
state in opencode_dir (remove the symlink and any partially created files/dirs
under opencode_dir) or call the legacy fallback _install_tools_legacy() to
ensure a consistent state, and (c) only emit
self.log.info("opencode.stage_installed", skills_path=...) when
local_skills.exists() (or otherwise include a boolean flag indicating absence)
so telemetry doesn’t claim a skills path that wasn’t created.

---

Outside diff comments:
In `@packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py`:
- Around line 1240-1299: The created background tasks opencode_task and
sidecar_task must be cancelled and drained in the run() finally path to avoid
leaked tasks/unhandled exceptions: in the finally block, if opencode_task or
sidecar_task is not None and not done(), call cancel(), then await them (e.g.,
with asyncio.gather(..., return_exceptions=True)) to retrieve and swallow
expected CancelledError/other exceptions before calling await self.shutdown();
also ensure any exceptions from those tasks are logged or attached to the error
report (include opencode.ready state). For the from_repo_image path, either
ensure .opencode/ is guaranteed gitignored in all repos or change the startup
ordering (do not call start_opencode() until after _update_existing_repo()/git
sync completes) to avoid concurrent writes vs git checkout races.

---

Nitpick comments:
In `@packages/sandbox-runtime/tests/test_entrypoint_build_mode.py`:
- Around line 255-277: Update the test_start_script_failure_is_fatal to assert
that start_opencode was invoked eagerly: after awaiting supervisor.run() add
supervisor.start_opencode.assert_called_once() (or .assert_called()) alongside
the existing fatal-error and start_bridge assertions so the test verifies the
concurrent-start contract for the supervisor object and its start_opencode
method.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: daa54a62-283e-4f56-9cc0-dfe37319fad5

📥 Commits

Reviewing files that changed from the base of the PR and between c6f0ea4 and 3341cc5.

📒 Files selected for processing (6)
  • .gitignore
  • eslint.config.js
  • package.json
  • packages/modal-infra/src/images/base.py
  • packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py
  • packages/sandbox-runtime/tests/test_entrypoint_build_mode.py

Comment on lines +183 to 195
.run_commands(
"mkdir -p /app/opencode-stage/tool /app/opencode-stage/skills",
# Tools: legacy plugin + new tools
"cp /app/sandbox_runtime/plugins/inspect-plugin.js"
" /app/opencode-stage/tool/create-pull-request.js || true",
"cp /app/sandbox_runtime/tools/*.js /app/opencode-stage/tool/ || true",
# Skills: copy each skill subdirectory
"cp -r /app/sandbox_runtime/skills/* /app/opencode-stage/skills/ || true",
# Deps: package.json, lockfile, and node_modules from pre-built cache
"cp /app/opencode-deps/package.json /app/opencode-deps/package-lock.json"
" /app/opencode-stage/",
"cp -a /app/opencode-deps/node_modules /app/opencode-stage/node_modules",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Silent || true fallbacks mask incomplete staging and defeat the legacy fallback.

If /app/sandbox_runtime/tools/*.js or /app/sandbox_runtime/skills/* produces no glob matches (shell passes the literal pattern to cp), the copy fails but || true swallows the error. The stage directory still gets created, so at boot _install_tools() sees stage_dir.is_dir() and takes the fast path — _install_tools_legacy() will never trigger even when tools/skills are missing. Users will silently boot with an empty .opencode/tool/ or .opencode/skills/.

Two options:

  1. Remove || true and let the build fail loudly if required inputs are missing (preferred if these globs are expected to be non-empty in every image).
  2. Guard the globs explicitly so “no match” is distinguishable from “cp error”, e.g.:
🛡️ Proposed fix — explicit existence checks + loud failure on real errors
     .run_commands(
         "mkdir -p /app/opencode-stage/tool /app/opencode-stage/skills",
-        # Tools: legacy plugin + new tools
-        "cp /app/sandbox_runtime/plugins/inspect-plugin.js"
-        " /app/opencode-stage/tool/create-pull-request.js || true",
-        "cp /app/sandbox_runtime/tools/*.js /app/opencode-stage/tool/ || true",
-        # Skills: copy each skill subdirectory
-        "cp -r /app/sandbox_runtime/skills/* /app/opencode-stage/skills/ || true",
+        # Tools: legacy plugin + new tools (only copy if present; fail loudly on real errors)
+        "if [ -f /app/sandbox_runtime/plugins/inspect-plugin.js ]; then"
+        "   cp /app/sandbox_runtime/plugins/inspect-plugin.js"
+        "      /app/opencode-stage/tool/create-pull-request.js;"
+        " fi",
+        "shopt -s nullglob;"
+        " files=(/app/sandbox_runtime/tools/*.js);"
+        ' if [ ${`#files`[@]} -gt 0 ]; then cp "${files[@]}" /app/opencode-stage/tool/; fi',
+        # Skills: copy each skill subdirectory
+        "if [ -d /app/sandbox_runtime/skills ]; then"
+        "   cp -r /app/sandbox_runtime/skills/. /app/opencode-stage/skills/;"
+        " fi",
         # Deps: package.json, lockfile, and node_modules from pre-built cache
         "cp /app/opencode-deps/package.json /app/opencode-deps/package-lock.json"
         " /app/opencode-stage/",
         "cp -a /app/opencode-deps/node_modules /app/opencode-stage/node_modules",
     )

Alternatively, have _install_from_stage() validate that tool/ and skills/ are non-empty and fall back to _install_tools_legacy() otherwise.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/modal-infra/src/images/base.py` around lines 183 - 195, The shell
fallbacks "|| true" in the .run_commands(...) block mask missing glob matches
for "/app/sandbox_runtime/tools/*.js" and "/app/sandbox_runtime/skills/*" so
staging can appear present while empty; remove the "|| true" from those cp
commands so the build fails loudly when expected inputs are missing, or
alternatively add explicit existence checks before copying (checking the glob
expands or directory is non-empty) so missing tools/skills cause an error; also
consider adding a runtime check in _install_from_stage() to verify
stage_dir/tool and stage_dir/skills are non-empty and call
_install_tools_legacy() if they are empty.

Comment on lines +311 to +341
def _install_from_stage(self, stage_dir: Path, opencode_dir: Path) -> None:
"""Fast path: populate .opencode/ from the pre-staged image directory."""
opencode_dir.mkdir(parents=True, exist_ok=True)

# Symlink node_modules — OpenCode's Npm.install() sees the lockfile
# in sync and skips arborist reify(), so the tree is never mutated.
staged_modules = stage_dir / "node_modules"
local_modules = opencode_dir / "node_modules"
if staged_modules.is_dir() and not local_modules.exists():
local_modules.symlink_to(staged_modules)

# Copy package.json + package-lock.json (small files, needed alongside symlink)
for name in ("package.json", "package-lock.json"):
src = stage_dir / name
dest = opencode_dir / name
if src.exists() and not dest.exists():
shutil.copy2(src, dest)

# Copy tool/ directory
staged_tools = stage_dir / "tool"
local_tools = opencode_dir / "tool"
if staged_tools.is_dir() and not local_tools.exists():
shutil.copytree(staged_tools, local_tools)

# Copy skills/ directory
staged_skills = stage_dir / "skills"
local_skills = opencode_dir / "skills"
if staged_skills.is_dir() and not local_skills.exists():
shutil.copytree(staged_skills, local_skills)

self.log.info("opencode.stage_installed", skills_path=str(local_skills))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Partial staging leaves .opencode/ in an inconsistent state with no recovery, and the success log is misleading.

Two issues:

  1. No error handling on copy/symlink operations. If shutil.copytree(staged_tools, local_tools) fails (e.g. permission issue, disk full, partial stage dir), .opencode/ is left with a symlinked node_modules plus possibly tool/ but no skills/. The next boot won't re-stage (all the not X.exists() guards short-circuit), and the legacy fallback is unreachable because stage_dir.is_dir() is still true. Consider wrapping in try/except and either cleaning up partial state or falling back to _install_tools_legacy().

  2. Line 341 always logs skills_path=str(local_skills) regardless of whether staged_skills actually existed or local_skills was populated. If the stage lacks skills/ (see concern in base.py), operators will see opencode.stage_installed pointing at a path that was never created — misleading telemetry that hides the very failure mode this fast path can produce.

🛡️ Suggested hardening
     def _install_from_stage(self, stage_dir: Path, opencode_dir: Path) -> None:
         """Fast path: populate .opencode/ from the pre-staged image directory."""
         opencode_dir.mkdir(parents=True, exist_ok=True)
 
-        # Symlink node_modules — OpenCode's Npm.install() sees the lockfile
-        # in sync and skips arborist reify(), so the tree is never mutated.
-        staged_modules = stage_dir / "node_modules"
-        local_modules = opencode_dir / "node_modules"
-        if staged_modules.is_dir() and not local_modules.exists():
-            local_modules.symlink_to(staged_modules)
-
-        # Copy package.json + package-lock.json (small files, needed alongside symlink)
-        for name in ("package.json", "package-lock.json"):
-            src = stage_dir / name
-            dest = opencode_dir / name
-            if src.exists() and not dest.exists():
-                shutil.copy2(src, dest)
-
-        # Copy tool/ directory
-        staged_tools = stage_dir / "tool"
-        local_tools = opencode_dir / "tool"
-        if staged_tools.is_dir() and not local_tools.exists():
-            shutil.copytree(staged_tools, local_tools)
-
-        # Copy skills/ directory
-        staged_skills = stage_dir / "skills"
-        local_skills = opencode_dir / "skills"
-        if staged_skills.is_dir() and not local_skills.exists():
-            shutil.copytree(staged_skills, local_skills)
-
-        self.log.info("opencode.stage_installed", skills_path=str(local_skills))
+        installed = {"node_modules": False, "tool": False, "skills": False}
+        try:
+            staged_modules = stage_dir / "node_modules"
+            local_modules = opencode_dir / "node_modules"
+            if staged_modules.is_dir() and not local_modules.exists():
+                local_modules.symlink_to(staged_modules)
+                installed["node_modules"] = True
+
+            for name in ("package.json", "package-lock.json"):
+                src = stage_dir / name
+                dest = opencode_dir / name
+                if src.exists() and not dest.exists():
+                    shutil.copy2(src, dest)
+
+            for subdir in ("tool", "skills"):
+                staged = stage_dir / subdir
+                local = opencode_dir / subdir
+                if staged.is_dir() and not local.exists():
+                    shutil.copytree(staged, local)
+                    installed[subdir] = True
+        except Exception as e:
+            self.log.warn("opencode.stage_install_failed", exc=e, installed=installed)
+            self._install_tools_legacy(opencode_dir.parent)
+            return
+
+        self.log.info("opencode.stage_installed", **{f"{k}_installed": v for k, v in installed.items()})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py` around lines 311
- 341, The fast-path installer _install_from_stage currently performs
symlink_to, shutil.copy2 and shutil.copytree calls without error handling and
unconditionally logs skills_path; wrap the sequence of filesystem ops in a
try/except around the sections that create
local_modules/local_tools/local_skills (references: _install_from_stage,
staged_modules/local_modules, shutil.copy2, shutil.copytree, symlink_to) so that
on any exception you (a) log the exception with details, (b) clean up partial
state in opencode_dir (remove the symlink and any partially created files/dirs
under opencode_dir) or call the legacy fallback _install_tools_legacy() to
ensure a consistent state, and (c) only emit
self.log.info("opencode.stage_installed", skills_path=...) when
local_skills.exists() (or otherwise include a boolean flag indicating absence)
so telemetry doesn’t claim a skills path that wasn’t created.

Move OpenCode startup to after git sync + hooks complete to avoid
racing with git checkout (.opencode/ writes) and to preserve the
contract that start.sh side effects are available during OpenCode
boot. Sidecars still start concurrently with git sync since they
don't write to .opencode/ or depend on hooks.
@github-actions
Copy link
Copy Markdown

Terraform Validation Results

Step Status
Format
Init
Validate

Note: Terraform plan was skipped because secrets are not configured. This is expected for external contributors. See docs/GETTING_STARTED.md for setup instructions.

Pushed by: @ColeMurray, Action: pull_request

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py (1)

1239-1302: ⚠️ Potential issue | 🟡 Minor

sidecar_task can be orphaned on the failure path.

sidecar_task is created at line 1245 but only awaited inside the happy path at line 1292. If _update_existing_repo() raises, run_start_script() returns False (→ raise RuntimeError at 1273), or start_opencode() raises before line 1299, control jumps to the except at 1327 without ever awaiting or cancelling sidecar_task. The task will be garbage-collected while still pending, producing Task was destroyed but it is pending! warnings, and any exception from it is silently dropped. shutdown() also doesn't wait on it before terminating the sidecar processes it spawned, which can race process startup with SIGTERM.

Consider cancelling/awaiting it in finally, e.g.:

         finally:
+            if sidecar_task and not sidecar_task.done():
+                sidecar_task.cancel()
+                try:
+                    await sidecar_task
+                except (asyncio.CancelledError, Exception):
+                    pass
             await self.shutdown()

Also nit: if from_repo_image and sidecar_task: at line 1290 is redundant — sidecar_task is non-None iff from_repo_imageif sidecar_task: is sufficient and keeps the two branches symmetric.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py` around lines 1239
- 1302, The sidecar_task created by calling self._start_sidecars() can be left
running if later operations raise, so ensure it is always cleaned up: wrap the
try block so that in the finally you check if sidecar_task is not None and if
still pending cancel it and await it (catch asyncio.CancelledError and log any
exceptions), and also ensure shutdown() waits for any outstanding sidecar_task;
replace the redundant "if from_repo_image and sidecar_task:" check with "if
sidecar_task:" to keep branches symmetric and to reliably await the
already-created task when present (references: sidecar_task,
self._start_sidecars, self._update_existing_repo, self.perform_git_sync,
self.run_start_script, self.start_opencode, shutdown).
♻️ Duplicate comments (1)
packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py (1)

311-341: ⚠️ Potential issue | 🟠 Major

Fast-path install still lacks error handling and emits misleading telemetry.

Per the earlier review thread, the author agreed to add try/except + fallback to _install_tools_legacy() and to make opencode.stage_installed reflect what was actually installed, but the current code on 311–341 still performs unguarded symlink_to/copy2/copytree and unconditionally logs skills_path=str(local_skills) even when staged_skills was absent or copytree failed mid-way. A partial failure leaves .opencode/ inconsistent, subsequent boots short-circuit on the not X.exists() guards, and the legacy fallback is unreachable because stage_dir.is_dir() is still true.

Please land the agreed hardening (wrap the ops in try/except, fall back to legacy or clean up partial state on failure, and log per-subdir installed booleans instead of a potentially-nonexistent local_skills).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py` around lines 311
- 341, The _install_from_stage fast-path performs unguarded filesystem ops
(symlink_to, shutil.copy2, shutil.copytree) and always logs skills_path even on
failure; wrap the sequence in a try/except inside _install_from_stage so any
exception cleans up partially-created artifacts (remove local_modules symlink,
remove partially-copied local_tools/local_skills, remove copied package files)
and then call the fallback method _install_tools_legacy(stage_dir, opencode_dir)
before re-raising or returning; change the telemetry call
(self.log.info("opencode.stage_installed", ...)) to report booleans/flags (e.g.,
modules_installed, package_json_copied, package_lock_copied, tools_installed,
skills_installed) based on existence of local_modules/local_tools/local_skills
and copied files so it reflects what actually succeeded rather than
unconditionally emitting skills_path.
🧹 Nitpick comments (1)
packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py (1)

815-815: Extract the health-poll interval into a named constant.

The new 0.1 poll literal is a magic number on a timing-sensitive hot path and sits alongside existing named constants (HEALTH_CHECK_TIMEOUT, SIDECAR_TIMEOUT_SECONDS, etc.). Per the repo's timing conventions, define it once with a unit-suffixed name and reuse it (the same value is also hard-coded at line 704 in _wait_for_port).

     HEALTH_CHECK_TIMEOUT = 30.0
+    HEALTH_CHECK_POLL_INTERVAL_SECONDS = 0.1
...
-                await asyncio.sleep(0.1)
+                await asyncio.sleep(self.HEALTH_CHECK_POLL_INTERVAL_SECONDS)

As per coding guidelines: "Define each default timeout value exactly once as a named constant and import it everywhere rather than restating literal values" and "always encode the unit (seconds for Python…) in the variable name".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py` at line 815,
Introduce a named constant for the 0.1 second poll interval (e.g.,
HEALTH_POLL_INTERVAL_SECONDS = 0.1) in the module and replace the literal 0.1 in
the await asyncio.sleep call inside the health loop and the duplicate literal in
_wait_for_port with that constant; update any imports/usages in this file so
timing defaults are defined once, use the unit-suffixed name (SECONDS) to match
existing conventions, and ensure both the health check loop and _wait_for_port
reference the new HEALTH_POLL_INTERVAL_SECONDS constant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py`:
- Around line 1239-1302: The sidecar_task created by calling
self._start_sidecars() can be left running if later operations raise, so ensure
it is always cleaned up: wrap the try block so that in the finally you check if
sidecar_task is not None and if still pending cancel it and await it (catch
asyncio.CancelledError and log any exceptions), and also ensure shutdown() waits
for any outstanding sidecar_task; replace the redundant "if from_repo_image and
sidecar_task:" check with "if sidecar_task:" to keep branches symmetric and to
reliably await the already-created task when present (references: sidecar_task,
self._start_sidecars, self._update_existing_repo, self.perform_git_sync,
self.run_start_script, self.start_opencode, shutdown).

---

Duplicate comments:
In `@packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py`:
- Around line 311-341: The _install_from_stage fast-path performs unguarded
filesystem ops (symlink_to, shutil.copy2, shutil.copytree) and always logs
skills_path even on failure; wrap the sequence in a try/except inside
_install_from_stage so any exception cleans up partially-created artifacts
(remove local_modules symlink, remove partially-copied local_tools/local_skills,
remove copied package files) and then call the fallback method
_install_tools_legacy(stage_dir, opencode_dir) before re-raising or returning;
change the telemetry call (self.log.info("opencode.stage_installed", ...)) to
report booleans/flags (e.g., modules_installed, package_json_copied,
package_lock_copied, tools_installed, skills_installed) based on existence of
local_modules/local_tools/local_skills and copied files so it reflects what
actually succeeded rather than unconditionally emitting skills_path.

---

Nitpick comments:
In `@packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py`:
- Line 815: Introduce a named constant for the 0.1 second poll interval (e.g.,
HEALTH_POLL_INTERVAL_SECONDS = 0.1) in the module and replace the literal 0.1 in
the await asyncio.sleep call inside the health loop and the duplicate literal in
_wait_for_port with that constant; update any imports/usages in this file so
timing defaults are defined once, use the unit-suffixed name (SECONDS) to match
existing conventions, and ensure both the health check loop and _wait_for_port
reference the new HEALTH_POLL_INTERVAL_SECONDS constant.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3c35231a-05ab-44f7-b041-6d63a60f3342

📥 Commits

Reviewing files that changed from the base of the PR and between 3341cc5 and 9a2a313.

📒 Files selected for processing (1)
  • packages/sandbox-runtime/src/sandbox_runtime/entrypoint.py

@ColeMurray ColeMurray closed this Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant