Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
0ac1fbd
fix(l7): reject duplicate Content-Length headers to prevent request s…
latenighthackathon Mar 29, 2026
94fbb64
fix(proxy): add L7 inspection to forward proxy path (#666)
latenighthackathon Mar 30, 2026
a69ef06
fix(ci): skip docs preview deploy for fork PRs (#679)
johntmyers Mar 30, 2026
c1dd81e
docs(rfc): add RFC process with draft/review/accepted lifecycle (#678)
drew Mar 30, 2026
0832f11
fix(e2e): add uv-managed python binary glob to forward proxy L7 test …
johntmyers Mar 30, 2026
38655a6
fix(l7): reject requests with both CL and TE headers in inference par…
latenighthackathon Mar 30, 2026
758c62d
fix(sandbox): handle per-path Landlock errors instead of abandoning e…
johntmyers Mar 30, 2026
8c4b172
Missed input parameter (#645)
vcorrea-ppc Mar 30, 2026
e8950e6
feat(sandbox): add L7 query parameter matchers (#617)
johntmyers Mar 30, 2026
0815f82
perf(sandbox): streaming SHA256 and spawn_blocking for identity resol…
koiker Mar 30, 2026
36329a1
feat(inference): allow setting custom inference timeout (#672)
pentschev Mar 30, 2026
ed74a19
fix(sandbox): track PTY state per SSH channel to fix terminal resize …
johntmyers Mar 30, 2026
047de66
feat(bootstrap,cli): switch GPU injection to CDI where supported (#495)
elezar Mar 31, 2026
122bc74
feat(sandbox): switch device plugin to CDI injection mode (#503)
elezar Mar 31, 2026
0eebbc8
fix(docker): restore apt cleanup chaining in cluster image (#702)
pimlock Mar 31, 2026
2538bea
fix(cluster): pass resolv-conf as kubelet arg and pin k3s image diges…
drew Mar 31, 2026
c567390
fix(cli): add Copilot variant to CliProviderType enum
Mar 31, 2026
4b8361c
feat(sandbox): L7 credential injection — query param rewriting and Ba…
htekdev Mar 26, 2026
546490d
ci: add fork release workflow for CLI binary and gateway image
htekdev Mar 28, 2026
0b2302b
chore: update Cargo.lock after merge
Mar 31, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 43 additions & 1 deletion .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,43 @@ Look for:
- `OOMKilled` — memory limits too low
- `FailedMount` — volume issues

### Step 8: Check DNS Resolution
### Step 8: Check GPU Device Plugin and CDI (GPU gateways only)

Skip this step for non-GPU gateways.

The NVIDIA device plugin DaemonSet must be running and healthy before GPU sandboxes can be created. It uses CDI injection (`deviceListStrategy: cdi-cri`) to inject GPU devices into sandbox pods — no `runtimeClassName` is set on sandbox pods.

```bash
# DaemonSet status — numberReady must be >= 1
openshell doctor exec -- kubectl get daemonset -n nvidia-device-plugin

# Device plugin pod logs — look for "CDI" lines confirming CDI mode is active
openshell doctor exec -- kubectl logs -n nvidia-device-plugin -l app.kubernetes.io/name=nvidia-device-plugin --tail=50

# List CDI devices registered by the device plugin (requires nvidia-ctk in the cluster image).
# Device plugin CDI entries use the vendor string "k8s.device-plugin.nvidia.com" so entries
# will be prefixed "k8s.device-plugin.nvidia.com/gpu=". If the list is empty, CDI spec
# generation has not completed yet.
openshell doctor exec -- nvidia-ctk cdi list

# Verify CDI spec files were generated on the node
openshell doctor exec -- ls /var/run/cdi/

# Helm install job logs for the device plugin chart
openshell doctor exec -- kubectl -n kube-system logs -l job-name=helm-install-nvidia-device-plugin --tail=100

# Confirm a GPU sandbox pod has no runtimeClassName (CDI injection, not runtime class)
openshell doctor exec -- kubectl get pod -n openshell -o jsonpath='{range .items[*]}{.metadata.name}{" runtimeClassName="}{.spec.runtimeClassName}{"\n"}{end}'
```

Common issues:

- **DaemonSet 0/N ready**: The device plugin chart may still be deploying (k3s Helm controller can take 1–2 min) or the pod is crashing. Check pod logs.
- **`nvidia-ctk cdi list` returns no `k8s.device-plugin.nvidia.com/gpu=` entries**: CDI spec generation has not completed. The device plugin may still be starting or the `cdi-cri` strategy isn't active. Verify `deviceListStrategy: cdi-cri` is in the rendered Helm values.
- **No CDI spec files at `/var/run/cdi/`**: Same as above — device plugin hasn't written CDI specs yet.
- **`HEALTHCHECK_GPU_DEVICE_PLUGIN_NOT_READY` in health check logs**: Device plugin has no ready pods. Check DaemonSet events and pod logs.

### Step 9: Check DNS Resolution

DNS misconfiguration is a common root cause, especially on remote/Linux hosts:

Expand Down Expand Up @@ -317,6 +353,7 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w
| gRPC `UNIMPLEMENTED` for newer RPCs in push mode | Helm values still point at older pulled images instead of the pushed refs | Verify rendered `openshell-helmchart.yaml` uses the expected push refs (`server`, `sandbox`, `pki-job`) and not `:latest` |
| Sandbox pods crash with `/opt/openshell/bin/openshell-sandbox: no such file or directory` | Supervisor binary missing from cluster image | The cluster image was built/published without the `supervisor-builder` target in `deploy/docker/Dockerfile.images`. Rebuild with `mise run docker:build:cluster` and recreate gateway. Bootstrap auto-detects via `HEALTHCHECK_MISSING_SUPERVISOR` marker |
| `HEALTHCHECK_MISSING_SUPERVISOR` in health check logs | `/opt/openshell/bin/openshell-sandbox` not found in gateway container | Rebuild cluster image: `mise run docker:build:cluster`, then `openshell gateway destroy <name> && openshell gateway start` |
| `nvidia-ctk cdi list` returns no `k8s.device-plugin.nvidia.com/gpu=` entries | CDI specs not yet generated by device plugin | Device plugin may still be starting; wait and retry, or check pod logs (Step 8) |

## Full Diagnostic Dump

Expand Down Expand Up @@ -370,4 +407,9 @@ openshell doctor exec -- ls -la /opt/openshell/bin/openshell-sandbox

echo "=== DNS Configuration ==="
openshell doctor exec -- cat /etc/rancher/k3s/resolv.conf

# GPU gateways only
echo "=== GPU Device Plugin ==="
openshell doctor exec -- kubectl get daemonset -n nvidia-device-plugin
openshell doctor exec -- nvidia-ctk cdi list
```
136 changes: 136 additions & 0 deletions .github/workflows/release-fork.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
name: Release Fork

on:
push:
branches: [feat/credential-injection-query-param-basic-auth]
workflow_dispatch:

concurrency:
group: release-fork-${{ github.ref }}
cancel-in-progress: true

permissions:
contents: write
packages: write

env:
CARGO_TERM_COLOR: always

jobs:
build-cli:
name: Build CLI (linux-amd64)
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Install Rust stable
uses: dtolnay/rust-toolchain@stable

- name: Install protoc
uses: arduino/setup-protoc@v3
with:
version: "29.x"
repo-token: ${{ secrets.GITHUB_TOKEN }}

- name: Cache cargo registry and build
uses: actions/cache@v4
with:
path: |
~/.cargo/registry
~/.cargo/git
target
key: cargo-cli-${{ runner.os }}-${{ hashFiles('**/Cargo.lock') }}
restore-keys: cargo-cli-${{ runner.os }}-

- name: Build openshell CLI (release)
run: cargo build --release -p openshell-cli

- name: Package binary
run: |
mkdir -p dist
cp target/release/openshell dist/
cd dist
tar czf openshell-linux-amd64.tar.gz openshell
sha256sum openshell-linux-amd64.tar.gz > openshell-linux-amd64.tar.gz.sha256

- name: Upload artifact
uses: actions/upload-artifact@v4
with:
name: openshell-linux-amd64
path: |
dist/openshell-linux-amd64.tar.gz
dist/openshell-linux-amd64.tar.gz.sha256

build-gateway:
name: Build gateway Docker image
runs-on: ubuntu-latest
timeout-minutes: 45
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Log in to GHCR
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Build and push gateway image
uses: docker/build-push-action@v6
with:
context: .
file: deploy/docker/Dockerfile.images
target: gateway
platforms: linux/amd64
push: true
tags: |
ghcr.io/htekdev/openshell-gateway:latest
ghcr.io/htekdev/openshell-gateway:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max

release:
name: Create GitHub Release
needs: [build-cli]
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- name: Download CLI artifact
uses: actions/download-artifact@v4
with:
name: openshell-linux-amd64
path: dist/

- name: Create or update release
uses: softprops/action-gh-release@v2
with:
tag_name: fork-latest
name: "Fork Release (credential injection)"
body: |
Pre-built OpenShell fork with L7 credential injection including
query-param rewriting and Basic auth encoding.

Branch: `feat/credential-injection-query-param-basic-auth`
Commit: ${{ github.sha }}

**Changes:** Extends the L7 proxy to inject API credentials at the
network layer for arbitrary REST endpoints, with support for query
parameter injection and HTTP Basic authentication encoding.

**Gateway image:** `ghcr.io/htekdev/openshell-gateway:latest`
draft: false
prerelease: true
make_latest: false
files: |
dist/openshell-linux-amd64.tar.gz
dist/openshell-linux-amd64.tar.gz.sha256
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
5 changes: 5 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,9 +186,14 @@ These are the primary `mise` tasks for day-to-day development:
| `tasks/` | `mise` task definitions and build scripts |
| `deploy/` | Dockerfiles, Helm chart, Kubernetes manifests |
| `architecture/` | Architecture docs and plans |
| `rfc/` | Request for Comments proposals |
| `docs/` | User-facing documentation (Sphinx/MyST) |
| `.agents/` | Agent skills and persona definitions |

## RFCs

For cross-cutting architectural decisions, API contract changes, or process proposals that need broad consensus, use the RFC process. RFCs live in `rfc/` — copy the template, fill it in, and open a PR for discussion. See [rfc/README.md](rfc/README.md) for the full lifecycle and guidelines on when to write an RFC versus a spike issue or architecture doc.

## Documentation

If your change affects user-facing behavior (new flags, changed defaults, new features, bug fixes that contradict existing docs), update the relevant pages under `docs/` in the same PR.
Expand Down
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ OpenShell can pass host GPUs into sandboxes for local inference, fine-tuning, or
openshell sandbox create --gpu --from [gpu-enabled-sandbox] -- claude
```

The CLI auto-bootstraps a GPU-enabled gateway on first use. GPU intent is also inferred automatically for community images with `gpu` in the name.
The CLI auto-bootstraps a GPU-enabled gateway on first use, auto-selecting CDI when available and otherwise falling back to Docker's NVIDIA GPU request path (`--gpus all`). GPU intent is also inferred automatically for community images with `gpu` in the name.

**Requirements:** NVIDIA drivers and the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) must be installed on the host. The sandbox image itself must include the appropriate GPU drivers and libraries for your workload — the default `base` image does not. See the [BYOC example](https://github.com/NVIDIA/OpenShell/tree/main/examples/bring-your-own-container) for building a custom sandbox image with GPU support.

Expand Down
24 changes: 16 additions & 8 deletions architecture/gateway-single-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@ On Docker custom networks, `/etc/resolv.conf` contains `127.0.0.11` (Docker's in
2. Getting the container's `eth0` IP as a routable address.
3. Adding DNAT rules in PREROUTING to forward DNS from pod namespaces through to Docker's DNS.
4. Writing a custom resolv.conf pointing to the container IP.
5. Passing `--resolv-conf=/etc/rancher/k3s/resolv.conf` to k3s.
5. Passing `--kubelet-arg=resolv-conf=/etc/rancher/k3s/resolv.conf` to k3s.

Falls back to `8.8.8.8` / `8.8.4.4` if iptables detection fails.

Expand Down Expand Up @@ -296,25 +296,33 @@ When environment variables are set, the entrypoint modifies the HelmChart manife

GPU support is part of the single-node gateway bootstrap path rather than a separate architecture.

- `openshell gateway start --gpu` threads a boolean deploy option through `crates/openshell-cli`, `crates/openshell-bootstrap`, and `crates/openshell-bootstrap/src/docker.rs`.
- When enabled, the cluster container is created with Docker `DeviceRequests`, which is the API equivalent of `docker run --gpus all`.
- `openshell gateway start --gpu` threads GPU device options through `crates/openshell-cli`, `crates/openshell-bootstrap`, and `crates/openshell-bootstrap/src/docker.rs`.
- When enabled, the cluster container is created with Docker `DeviceRequests`. The injection mechanism is selected based on whether CDI is enabled on the daemon (`SystemInfo.CDISpecDirs` via `GET /info`):
- **CDI enabled** (daemon reports non-empty `CDISpecDirs`): CDI device injection — `driver="cdi"` with `nvidia.com/gpu=all`. Specs are expected to be pre-generated on the host (e.g. automatically by the `nvidia-cdi-refresh.service` or manually via `nvidia-ctk generate`).
- **CDI not enabled**: `--gpus all` device request — `driver="nvidia"`, `count=-1`, which relies on the NVIDIA Container Runtime hook.
- `deploy/docker/Dockerfile.images` installs NVIDIA Container Toolkit packages in a dedicated Ubuntu stage and copies the runtime binaries, config, and `libnvidia-container` shared libraries into the final Ubuntu-based cluster image.
- `deploy/docker/cluster-entrypoint.sh` checks `GPU_ENABLED=true` and copies GPU-only manifests from `/opt/openshell/gpu-manifests/` into k3s's manifests directory.
- `deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml` installs the NVIDIA device plugin chart, currently pinned to `0.18.2`. NFD and GFD are disabled; the device plugin's default `nodeAffinity` (which requires `feature.node.kubernetes.io/pci-10de.present=true` or `nvidia.com/gpu.present=true` from NFD/GFD) is overridden to empty so the DaemonSet schedules on the single-node cluster without requiring those labels.
- `deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml` installs the NVIDIA device plugin chart, currently pinned to `0.18.2`. NFD and GFD are disabled; the device plugin's default `nodeAffinity` (which requires `feature.node.kubernetes.io/pci-10de.present=true` or `nvidia.com/gpu.present=true` from NFD/GFD) is overridden to empty so the DaemonSet schedules on the single-node cluster without requiring those labels. The chart is configured with `deviceListStrategy: cdi-cri` so the device plugin injects devices via direct CDI device requests in the CRI.
- k3s auto-detects `nvidia-container-runtime` on `PATH`, registers the `nvidia` containerd runtime, and creates the `nvidia` `RuntimeClass` automatically.
- The OpenShell Helm chart grants the gateway service account cluster-scoped read access to `node.k8s.io/runtimeclasses` and core `nodes` so GPU sandbox admission can verify both the `nvidia` `RuntimeClass` and allocatable GPU capacity before creating a sandbox.

The runtime chain is:

```text
Host GPU drivers & NVIDIA Container Toolkit
└─ Docker: --gpus all (DeviceRequests in bollard API)
└─ Docker: DeviceRequests (CDI when enabled, --gpus all otherwise)
└─ k3s/containerd: nvidia-container-runtime on PATH -> auto-detected
└─ k8s: nvidia-device-plugin DaemonSet advertises nvidia.com/gpu
└─ Pods: request nvidia.com/gpu in resource limits
└─ Pods: request nvidia.com/gpu in resource limits (CDI injection — no runtimeClassName needed)
```

The expected smoke test is a plain pod requesting `nvidia.com/gpu: 1` with `runtimeClassName: nvidia` and running `nvidia-smi`.
### `--gpu` flag

The `--gpu` flag on `gateway start` enables GPU passthrough. OpenShell auto-selects CDI when enabled on the daemon and falls back to Docker's NVIDIA GPU request path (`--gpus all`) otherwise.

Device injection uses CDI (`deviceListStrategy: cdi-cri`): the device plugin injects devices via direct CDI device requests in the CRI. Sandbox pods only need `nvidia.com/gpu: 1` in their resource limits, and GPU pods do not set `runtimeClassName`.

The expected smoke test is a plain pod requesting `nvidia.com/gpu: 1` without `runtimeClassName` and running `nvidia-smi`.

## Remote Image Transfer

Expand Down Expand Up @@ -381,7 +389,7 @@ When `openshell sandbox create` cannot connect to a gateway (connection refused,
1. `should_attempt_bootstrap()` in `crates/openshell-cli/src/bootstrap.rs` checks the error type. It returns `true` for connectivity errors and missing default TLS materials, but `false` for TLS handshake/auth errors.
2. If running in a terminal, the user is prompted to confirm.
3. `run_bootstrap()` deploys a gateway named `"openshell"`, sets it as active, and returns fresh `TlsOptions` pointing to the newly-written mTLS certs.
4. When `sandbox create` requests GPU explicitly (`--gpu`) or infers it from an image whose final name component contains `gpu` (such as `nvidia-gpu`), the bootstrap path enables gateway GPU support before retrying sandbox creation.
4. When `sandbox create` requests GPU explicitly (`--gpu`) or infers it from an image whose final name component contains `gpu` (such as `nvidia-gpu`), the bootstrap path enables gateway GPU support before retrying sandbox creation, using the same CDI-or-fallback selection as `gateway start --gpu`.

## Container Environment Variables

Expand Down
Loading
Loading