Problem Statement
GPU access currently relies on the legacy nvidia-container-runtime +
nvidia-container-cli stack at two layers: once when Docker injects GPUs
into the k3s cluster container, and again when the nvidia-device-plugin +
nvidia-container-runtime inject them into individual sandbox pods.
Proposed Design
Both layers should be migrated to CDI instead. The general idea:
- Generate a CDI spec on the host before starting the cluster:
nvidia-ctk cdi generate
- Use Docker's native CDI support (available since Docker 25) to pass GPUs
into the k3s container: --device nvidia.com/gpu=all
- Mount
/etc/cdi into the k3s container, enable enable_cdi_devices = true
in the containerd config, and configure the nvidia-device-plugin to use CDI
device IDs so containerd handles injection natively
CDI is the canonical way NVIDIA supports GPU access in containerized
environments going forward. Some platforms require CDI and are incompatible
with the legacy runtime stack, so this would also broaden the set of platforms
OpenShell can run on. It also makes what gets injected explicit and
auditable via the CDI spec rather than delegating to a CLI with broad host
access.
/cc @elezar @jgehrcke
Alternatives Considered
None
Agent Investigation
No response
Checklist
Problem Statement
GPU access currently relies on the legacy
nvidia-container-runtime+nvidia-container-clistack at two layers: once when Docker injects GPUsinto the k3s cluster container, and again when the nvidia-device-plugin +
nvidia-container-runtimeinject them into individual sandbox pods.Proposed Design
Both layers should be migrated to CDI instead. The general idea:
nvidia-ctk cdi generateinto the k3s container:
--device nvidia.com/gpu=all/etc/cdiinto the k3s container, enableenable_cdi_devices = truein the containerd config, and configure the nvidia-device-plugin to use CDI
device IDs so containerd handles injection natively
CDI is the canonical way NVIDIA supports GPU access in containerized
environments going forward. Some platforms require CDI and are incompatible
with the legacy runtime stack, so this would also broaden the set of platforms
OpenShell can run on. It also makes what gets injected explicit and
auditable via the CDI spec rather than delegating to a CLI with broad host
access.
/cc @elezar @jgehrcke
Alternatives Considered
None
Agent Investigation
No response
Checklist