Skip to content

fix(shipyard-neo): add readiness gate and graceful sandbox cleanup#7881

Merged
RC-CHN merged 3 commits intoAstrBotDevs:masterfrom
RC-CHN:fix-shipyard-neo-readiness-gate
Apr 29, 2026
Merged

fix(shipyard-neo): add readiness gate and graceful sandbox cleanup#7881
RC-CHN merged 3 commits intoAstrBotDevs:masterfrom
RC-CHN:fix-shipyard-neo-readiness-gate

Conversation

@RC-CHN
Copy link
Copy Markdown
Member

@RC-CHN RC-CHN commented Apr 29, 2026

Cold-start races in the Shipyard Neo booter can cause a persistent failure loop:

  • create_sandbox() returns while the Bay session is still STARTING.
  • _sync_skills_to_sandbox() runs immediately and fails because the session isn'tready yet.
  • The boot exception discards the booter without cleaning up the sandbox on Bay — Docker containers, volumes, and networks are leaked.
  • The next request rebuilds a fresh sandbox, and if the underlying cause persists, this repeats on every tool call.
  • Additionally, when a stale booter is evicted from session_booter (due to
    available() returning False), shutdown() was never called, leaking both the
    BayClient HTTP session and the Bay sandbox resources.

Modifications / 改动点

astrbot/core/computer/booters/shipyard_neo.py

  • Added _wait_until_ready() — polls sandbox.refresh() every 2s until status is READY, or raises with cleanup on FAILED/EXPIRED/180s timeout. On terminal or timeout states, sandbox.delete() is called before raising so Bay resources are fully released.

  • Extended shutdown(*, delete_sandbox=False) — when delete_sandbox=True, calls sandbox.delete() before client.aexit()(order matters: the HTTP session must still be alive for the DELETE request).
    astrbot/core/computer/booters/base.py

  • Changed shutdown() signature to shutdown(**kwargs) so subclasses can accept type-specific cleanup arguments without leaking sandbox concepts into the abstract interface.

astrbot/core/computer/computer_client.py

  • On stale-booter eviction: calls shutdown(delete_sandbox=True) before popping from session_booter.

  • On boot error: calls shutdown(delete_sandbox=True) so sandboxes from failed boots are deleted rather than abandoned.

  • Both guarded by booter_type == "shipyard_neo" — other booter types (local, boxlite, cua) are unaffected.

  • This is NOT a breaking change. / 这不是一个破坏性变更。

Screenshots or Test Results / 运行截图或测试结果

image image

Checklist / 检查清单

  • 😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
    / 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。

  • 👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
    / 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”

  • 🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in requirements.txt and pyproject.toml.
    / 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到 requirements.txtpyproject.toml 文件相应位置。

  • 😮 My changes do not introduce malicious code.
    / 我的更改没有引入恶意代码。

Summary by Sourcery

Ensure Shipyard Neo sandboxes are only used after they are ready and are cleaned up reliably on shutdown and boot failures to prevent resource leaks on Bay.

Bug Fixes:

  • Add a readiness wait before initializing Shipyard Neo components so cold-start races no longer use sandboxes before they are ready.
  • Delete Shipyard Neo sandboxes on stale-booter eviction and boot errors to avoid leaking remote containers, volumes, and networks.

Enhancements:

  • Extend the booter shutdown interface to accept keyword arguments, allowing type-specific cleanup options such as deleting remote sandboxes.

@auto-assign auto-assign Bot requested review from LIghtJUNction and anka-afk April 29, 2026 01:52
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. area:core The bug / feature is about astrbot's core, backend labels Apr 29, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • Now that shutdown accepts **kwargs, the explicit booter_type == "shipyard_neo" branching in computer_client.get_booter is no longer necessary—calling await booter.shutdown(delete_sandbox=True) unconditionally (with subclasses that ignore the kwarg) would simplify the logic and avoid future drift between the type string and implementation.
  • Consider making the readiness timeout and poll interval in _wait_until_ready configurable (e.g., module-level constants or constructor arguments) so they can be tuned per environment without code changes, and optionally wrap the poll loop in a try/except asyncio.CancelledError to ensure the sandbox is cleaned up if the boot task is cancelled mid-wait.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Now that `shutdown` accepts `**kwargs`, the explicit `booter_type == "shipyard_neo"` branching in `computer_client.get_booter` is no longer necessary—calling `await booter.shutdown(delete_sandbox=True)` unconditionally (with subclasses that ignore the kwarg) would simplify the logic and avoid future drift between the type string and implementation.
- Consider making the readiness timeout and poll interval in `_wait_until_ready` configurable (e.g., module-level constants or constructor arguments) so they can be tuned per environment without code changes, and optionally wrap the poll loop in a `try/except asyncio.CancelledError` to ensure the sandbox is cleaned up if the boot task is cancelled mid-wait.

## Individual Comments

### Comment 1
<location path="astrbot/core/computer/booters/shipyard_neo.py" line_range="475-479" />
<code_context>
+
+        while True:
+            await sandbox.refresh()
+            status = getattr(sandbox.status, "value", str(sandbox.status))
+
+            if status == "ready":
+                logger.info(
+                    "[Computer] Sandbox %s is ready (profile=%s)",
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Normalize status once (e.g., lowercasing) to be more resilient to enum/string case changes.

This relies on `status` being exactly "ready"/"failed"/"expired"; if `Sandbox.status` ever changes casing (e.g., enum `Status.READY` with value "READY"), the loop will spin until timeout instead of taking the correct branch. Normalize once to lowercase, e.g. `status = str(getattr(sandbox.status, "value", sandbox.status)).lower()`, and compare against lowercase literals.

```suggestion
        while True:
            await sandbox.refresh()
            status = str(getattr(sandbox.status, "value", sandbox.status)).lower()

            if status == "ready":
```
</issue_to_address>

### Comment 2
<location path="astrbot/core/computer/computer_client.py" line_range="454" />
<code_context>
+            # (local, boxlite, cua, etc.) are not backed by a remote sandbox
+            # manager and don't need it.
+            try:
+                if booter_type == "shipyard_neo":
+                    await booter.shutdown(delete_sandbox=True)
+                else:
</code_context>
<issue_to_address>
**issue (complexity):** Consider moving the delete_sandbox policy into a polymorphic shutdown_with_cleanup method on the booter classes to avoid booter_type conditionals at call sites.

You can keep the new `delete_sandbox` behavior but avoid spreading `booter_type` conditionals by pushing the policy into the booter API.

For example, add a polymorphic “shutdown with cleanup” method on the base class and override it in `ShipyardNeoBooter`:

```python
# In BaseComputerBooter
class BaseComputerBooter(ABC):
    ...

    async def shutdown_with_cleanup(self) -> None:
        # Default: just shutdown, no remote sandbox semantics
        await self.shutdown()
```

```python
# In ShipyardNeoBooter
class ShipyardNeoBooter(BaseComputerBooter):
    ...

    async def shutdown_with_cleanup(self) -> None:
        # Shipyard-specific: delete remote sandbox as well
        await self.shutdown(delete_sandbox=True)
```

Then your calling code becomes uniform and loses the `booter_type` checks:

```python
# When booter is stale
if session_id in session_booter:
    booter = session_booter[session_id]
    if not await booter.available():
        try:
            await booter.shutdown_with_cleanup()
        except Exception as shutdown_err:
            logger.warning(
                "[Computer] Error shutting down stale booter for session %s: %s",
                session_id,
                shutdown_err,
            )
        session_booter.pop(session_id, None)
```

```python
# On boot error
except Exception as e:
    logger.error(f"Error booting sandbox for session {session_id}: {e}")
    try:
        await client.shutdown_with_cleanup()
    except Exception as shutdown_error:
        logger.warning(
            "Failed to shutdown sandbox after boot error for session %s: %s",
            session_id,
            shutdown_error,
        )
    raise e
```

This keeps all existing behavior (including `delete_sandbox=True` for Shipyard Neo) but confines the special case to the booter implementation instead of duplicating `booter_type == "shipyard_neo"` branches at call sites.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/core/computer/booters/shipyard_neo.py
Comment thread astrbot/core/computer/computer_client.py
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a readiness gate for Shipyard Neo sandboxes, ensuring they are fully operational before use, and enhances resource management by allowing explicit sandbox deletion during shutdown. The base ComputerBooter class was updated to support flexible cleanup via keyword arguments. Review feedback suggests leveraging polymorphism in the computer_client.py logic to avoid hardcoded type checks when shutting down booters, which prevents potential resource leaks if the configuration changes. Additionally, it is recommended to update the ShipyardNeoBooter.shutdown signature to include **kwargs for consistency with the base class interface.

Comment thread astrbot/core/computer/computer_client.py
return chosen

async def shutdown(self) -> None:
async def shutdown(self, *, delete_sandbox: bool = False) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The shutdown method should accept **kwargs to maintain consistency with the updated ComputerBooter base class signature. This ensures that any arbitrary keyword arguments passed by callers are handled gracefully (ignored) rather than causing a TypeError if the booter type is substituted or if the base interface is used polymorphically. Additionally, ensure this change is accompanied by corresponding unit tests.

Suggested change
async def shutdown(self, *, delete_sandbox: bool = False) -> None:
async def shutdown(self, *, delete_sandbox: bool = False, **kwargs) -> None:
References
  1. New functionality or changes to logic should be accompanied by corresponding unit tests.

Comment on lines +527 to +530
if booter_type == "shipyard_neo":
await client.shutdown(delete_sandbox=True)
else:
await client.shutdown()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This conditional check is redundant now that the ComputerBooter.shutdown base class supports **kwargs. Simplifying this to a single call with delete_sandbox=True improves maintainability and adheres to the goal of keeping the client logic decoupled from specific booter implementations.

                await client.shutdown(delete_sandbox=True)
References
  1. When implementing similar functionality for different cases, refactor the logic into a shared helper function or use polymorphism to avoid code duplication.

@RC-CHN RC-CHN merged commit eb69bf3 into AstrBotDevs:master Apr 29, 2026
35 of 36 checks passed
@RC-CHN RC-CHN deleted the fix-shipyard-neo-readiness-gate branch April 29, 2026 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core The bug / feature is about astrbot's core, backend size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant