Problem Statement
When a user runs gateway stop followed by gateway start, sandbox pods are re-provisioned but all filesystem state inside them is lost. Users expect workspace files, installed packages, and other pod-local data to survive a gateway restart cycle.
There are two independent failure modes:
-
Sandbox pods are ephemeral by default. Pods have no PersistentVolumeClaims, so even a simple pod reschedule (which happens on every k3s restart) loses the writable layer.
-
Container recreation changes k3s node identity. When the gateway image changes between stop and start, the Docker container is recreated with a new container ID. k3s uses the container ID as its node name, so a new node is registered. clean_stale_nodes() then deletes all PVCs with node affinity for the old node — including the server's own StatefulSet PVC (openshell-data), wiping the SQLite database entirely.
Proposed Design
1. Stabilize k3s node identity across container recreations
Pass a deterministic --node-name to k3s in the cluster entrypoint script, derived from the gateway name rather than the container ID. This prevents node identity churn and stops clean_stale_nodes() from nuking PVCs when the container is recreated.
Files: deploy/docker/cluster-entrypoint.sh, crates/openshell-bootstrap/src/docker.rs
2. Add a default workspace PVC to sandbox pods
Automatically include a volumeClaimTemplate in the sandbox pod spec so that the sandbox's home/workspace directory is backed by persistent storage. The Sandbox CRD already supports volumeClaimTemplates — this just needs to be populated by default during sandbox creation.
Files: crates/openshell-server/src/sandbox/mod.rs
Agent Investigation
Explored the full gateway lifecycle and sandbox provisioning code:
gateway stop only calls docker stop (docker.rs:878), preserving container + volume + network.
gateway start calls ensure_network() which always destroys/recreates the Docker bridge, then ensure_container() which reuses or recreates the container depending on image match (docker.rs:473-567).
- On container recreate,
clean_stale_nodes() (runtime.rs:379-519) deletes NotReady nodes, terminating pods, and PVCs with stale node affinity — including the server's own database PVC.
- Sandbox pods are created with only a hostPath (supervisor binary, read-only) and optional TLS secret volume (
sandbox/mod.rs:645-661). No PVCs by default.
- The Sandbox CRD supports
volumeClaimTemplates (datamodel.proto:47) but OpenShell doesn't populate this field.
- The k3s node name defaults to the container hostname (= container ID), which changes on container recreation.
Definition of Done
Problem Statement
When a user runs
gateway stopfollowed bygateway start, sandbox pods are re-provisioned but all filesystem state inside them is lost. Users expect workspace files, installed packages, and other pod-local data to survive a gateway restart cycle.There are two independent failure modes:
Sandbox pods are ephemeral by default. Pods have no PersistentVolumeClaims, so even a simple pod reschedule (which happens on every k3s restart) loses the writable layer.
Container recreation changes k3s node identity. When the gateway image changes between stop and start, the Docker container is recreated with a new container ID. k3s uses the container ID as its node name, so a new node is registered.
clean_stale_nodes()then deletes all PVCs with node affinity for the old node — including the server's own StatefulSet PVC (openshell-data), wiping the SQLite database entirely.Proposed Design
1. Stabilize k3s node identity across container recreations
Pass a deterministic
--node-nameto k3s in the cluster entrypoint script, derived from the gateway name rather than the container ID. This prevents node identity churn and stopsclean_stale_nodes()from nuking PVCs when the container is recreated.Files:
deploy/docker/cluster-entrypoint.sh,crates/openshell-bootstrap/src/docker.rs2. Add a default workspace PVC to sandbox pods
Automatically include a
volumeClaimTemplatein the sandbox pod spec so that the sandbox's home/workspace directory is backed by persistent storage. The Sandbox CRD already supportsvolumeClaimTemplates— this just needs to be populated by default during sandbox creation.Files:
crates/openshell-server/src/sandbox/mod.rsAgent Investigation
Explored the full gateway lifecycle and sandbox provisioning code:
gateway stoponly callsdocker stop(docker.rs:878), preserving container + volume + network.gateway startcallsensure_network()which always destroys/recreates the Docker bridge, thenensure_container()which reuses or recreates the container depending on image match (docker.rs:473-567).clean_stale_nodes()(runtime.rs:379-519) deletes NotReady nodes, terminating pods, and PVCs with stale node affinity — including the server's own database PVC.sandbox/mod.rs:645-661). No PVCs by default.volumeClaimTemplates(datamodel.proto:47) but OpenShell doesn't populate this field.Definition of Done
clean_stale_nodes()no longer deletes PVCs unnecessarily after image upgradesvolumeClaimTemplatespassthrough still works