Skip to content

[Epic] Provision Core Dependencies from Multiple Sources (Package, Git, Latest) #567

@ArangoGutierrez

Description

@ArangoGutierrez

Epic: Provision Core Dependencies from Multiple Sources

Summary

Extend Holodeck's provisioning capabilities to install all core dependencies from multiple sources:

  • (a) Distribution packages (current behavior - default)
  • (b) A specific git reference (commit, branch, or tag)
  • (c) A moving "latest" alias tracking a branch (e.g., main)

Scope

This epic covers flexible installation for:

  1. NVIDIA Driver - Support for different branches, runfile installers, or package versions
  2. Container Runtime - containerd, Docker, CRI-O from specific versions or source
  3. Kubernetes - kubeadm/kubelet/kubectl from specific commits or versions
  4. NVIDIA Container Toolkit - Covered in separate epic [Epic] NVIDIA Container Toolkit Installation from Multiple Sources #566

Motivation

Testing GPU infrastructure requires validating different component combinations:

  • Driver validation: Test specific driver branches or versions for bug fixes
  • Runtime compatibility: Verify containerd/CRI-O HEAD against stable drivers
  • Kubernetes pre-release: Test alpha/beta Kubernetes features
  • Regression testing: Bisect issues across dependency versions
  • Reproducibility: Pin exact versions for consistent test environments

Proposed Schema

apiVersion: holodeck.nvidia.com/v1alpha1
kind: Environment
spec:
  # NVIDIA Driver with source selection
  nvidiaDriver:
    install: true
    source: package | runfile | git  # default: package
    package:
      branch: "560"  # driver branch
      version: "560.35.03"  # exact version (optional)
    runfile:
      url: https://download.nvidia.com/...driver.run
      checksum: sha256:...
    git:
      repo: https://github.com/NVIDIA/open-gpu-kernel-modules.git
      ref: refs/tags/560.35.03
  
  # Container Runtime with source selection
  containerRuntime:
    install: true
    name: containerd | docker | crio
    source: package | git | latest  # default: package
    package:
      version: "1.7.23"
    git:
      repo: https://github.com/containerd/containerd.git
      ref: refs/tags/v1.7.23
    latest:
      track: main
  
  # Kubernetes with source selection
  kubernetes:
    install: true
    installer: kubeadm | kind | microk8s
    source: package | release | git  # default: release (dl.k8s.io)
    release:
      version: v1.31.1
    git:
      repo: https://github.com/kubernetes/kubernetes.git
      ref: refs/heads/master  # test latest k8s

Subtasks

Phase 1: Common Infrastructure

  • Define generic source specification pattern

    type SourceSpec struct {
        Type    SourceType `json:"source,omitempty"`
        Package *PackageSourceSpec `json:"package,omitempty"`
        Git     *GitSourceSpec     `json:"git,omitempty"`
        Latest  *LatestSourceSpec  `json:"latest,omitempty"`
    }
    
    type GitSourceSpec struct {
        Repo       string            `json:"repo,omitempty"`
        Ref        string            `json:"ref"`
        Build      *BuildSpec        `json:"build,omitempty"`
        PreBuilt   *PreBuiltSpec     `json:"preBuilt,omitempty"`
    }
  • Implement generic ref resolver

    • Reusable across all components
    • Support for GitHub, GitLab, and generic git repos
    • Cache resolved refs for efficiency
  • Implement generic build infrastructure

    • Common build environment setup
    • Go, C/C++ toolchain detection
    • Build artifact management

Component: NVIDIA Driver

Phase 2: Driver Schema

  • Extend NVIDIADriver spec
    type NVIDIADriver struct {
        Install bool `json:"install"`
        Source  DriverSource `json:"source,omitempty"` // package, runfile, git
        
        // Package source (default)
        Package *DriverPackageSpec `json:"package,omitempty"`
        
        // Runfile source (manual installer)
        Runfile *DriverRunfileSpec `json:"runfile,omitempty"`
        
        // Git source (open-gpu-kernel-modules)
        Git *DriverGitSpec `json:"git,omitempty"`
    }
    
    type DriverPackageSpec struct {
        Branch  string `json:"branch,omitempty"`  // 560, 550, etc.
        Version string `json:"version,omitempty"` // exact version
    }
    
    type DriverRunfileSpec struct {
        URL      string `json:"url"`
        Checksum string `json:"checksum,omitempty"`
    }
    
    type DriverGitSpec struct {
        Repo string `json:"repo,omitempty"`
        Ref  string `json:"ref"`
    }

Phase 3: Driver Installation Paths

  • Package installation (enhanced)

    • Support branch selection (560, 550, 545, etc.)
    • Support exact version pinning
    • Better error handling for unavailable versions
    # Install specific branch
    apt-get install cuda-drivers-560
    
    # Install exact version
    apt-get install cuda-drivers=560.35.03-1
  • Runfile installation

    • Download .run file from URL
    • Verify checksum
    • Silent installation with appropriate flags
    wget -O driver.run "${RUNFILE_URL}"
    echo "${CHECKSUM} driver.run" | sha256sum -c
    chmod +x driver.run
    ./driver.run --silent --dkms
  • Open kernel modules build

    • Clone open-gpu-kernel-modules at ref
    • Build kernel modules
    • Install with DKMS
    git clone --depth 1 --branch ${REF} ${REPO}
    make modules -j$(nproc)
    make modules_install

Component: Container Runtime

Phase 4: Runtime Schema

  • Extend ContainerRuntime spec
    type ContainerRuntime struct {
        Install bool `json:"install"`
        Name    ContainerRuntimeName `json:"name"`
        Source  RuntimeSource `json:"source,omitempty"`
        
        Package *RuntimePackageSpec `json:"package,omitempty"`
        Git     *RuntimeGitSpec     `json:"git,omitempty"`
        Latest  *RuntimeLatestSpec  `json:"latest,omitempty"`
    }

Phase 5: Containerd Installation Paths

  • Package installation (enhanced)

    • Support version pinning
    • Support Docker's containerd.io packages
    apt-get install containerd.io=${VERSION}
  • Git/release installation

    • Download pre-built binaries from GitHub releases
    • Or build from source at specific ref
    # Pre-built
    wget https://github.com/containerd/containerd/releases/download/v${VERSION}/containerd-${VERSION}-linux-amd64.tar.gz
    tar -C /usr/local -xzf containerd-*.tar.gz
    
    # From source
    git clone --branch ${REF} https://github.com/containerd/containerd.git
    make && make install

Phase 6: Docker Installation Paths

  • Package installation (current)

    • Use Docker's official repository
    • Version pinning support
  • Moby from source

    • Build moby/moby at specific ref
    • Include cri-dockerd for Kubernetes compatibility

Phase 7: CRI-O Installation Paths

  • Package installation (enhanced)

    • Version pinning
    • Repository selection
  • Source build

    • Clone cri-o/cri-o at ref
    • Build with Go

Component: Kubernetes

Phase 8: Kubernetes Schema

  • Extend Kubernetes spec
    type Kubernetes struct {
        Install   bool   `json:"install"`
        Installer string `json:"installer"` // kubeadm, kind, microk8s
        Source    K8sSource `json:"source,omitempty"` // release, git
        
        // Release source (default) - from dl.k8s.io
        Release *K8sReleaseSpec `json:"release,omitempty"`
        
        // Git source - build from kubernetes/kubernetes
        Git *K8sGitSpec `json:"git,omitempty"`
    }
    
    type K8sReleaseSpec struct {
        Version string `json:"version"` // v1.31.1
    }
    
    type K8sGitSpec struct {
        Repo string `json:"repo,omitempty"`
        Ref  string `json:"ref"`  // refs/heads/master, refs/tags/v1.32.0-alpha.1
    }

Phase 9: Kubernetes Installation Paths

  • Release installation (enhanced)

    • Current behavior from dl.k8s.io
    • Better version validation
  • Git source installation

    • Clone kubernetes/kubernetes at ref
    • Build kubeadm, kubelet, kubectl
    • Install binaries
    git clone --depth 1 --branch ${REF} https://github.com/kubernetes/kubernetes.git
    make WHAT="cmd/kubeadm cmd/kubelet cmd/kubectl"
    install -m 755 _output/bin/{kubeadm,kubelet,kubectl} /usr/local/bin/
  • Kind from git

    • Build kind at specific ref
    • Useful for testing kind changes

Phase 10: Provenance & Status

  • Track component sources in status

    status:
      components:
        driver:
          source: package
          version: "560.35.03"
          branch: "560"
        runtime:
          source: git
          name: containerd
          ref: refs/tags/v1.7.23
          commit: abc123
        toolkit:
          source: latest
          track: main
          commit: def456
        kubernetes:
          source: release
          version: v1.31.1
  • Display in CLI

    holodeck describe <instance-id>
    # ...
    # Components:
    #   NVIDIA Driver: 560.35.03 (package, branch 560)
    #   Container Runtime: containerd v1.7.23 (git, abc123)
    #   Container Toolkit: main@def456 (latest)
    #   Kubernetes: v1.31.1 (release)

Phase 11: Testing

  • Unit tests

    • Schema validation
    • Source detection
    • Version parsing
  • Integration tests per component

    • Driver: package, runfile (if available)
    • Containerd: package, git tag
    • Kubernetes: release, git tag
  • E2E matrix

    • Common combinations
    • Edge cases (latest main branches)

Phase 12: Documentation

  • Schema reference for each component
  • Source selection guide
    • When to use each source type
    • Trade-offs (reproducibility vs freshness)
  • Troubleshooting
    • Build failures
    • Version incompatibilities

Example Configurations

All Latest (Testing Bleeding Edge)

spec:
  nvidiaDriver:
    install: true
    source: git
    git:
      ref: refs/heads/main  # open-gpu-kernel-modules main
  containerRuntime:
    install: true
    name: containerd
    source: latest
    latest:
      track: main
  nvidiaContainerToolkit:
    install: true
    source: latest
    latest:
      track: main
  kubernetes:
    install: true
    source: git
    git:
      ref: refs/heads/master  # k8s master

All Pinned (Reproducible Environment)

spec:
  nvidiaDriver:
    install: true
    source: package
    package:
      version: "560.35.03"
  containerRuntime:
    install: true
    name: containerd
    source: package
    package:
      version: "1.7.20"
  nvidiaContainerToolkit:
    install: true
    source: package
    package:
      version: "1.17.3-1"
  kubernetes:
    install: true
    source: release
    release:
      version: v1.31.1

Acceptance Criteria

  • Each component supports at least package + one alternative source
  • Version pinning works for all package sources
  • Git/latest sources build successfully
  • Status shows source information for each component
  • Dependencies between components are validated
  • Documentation covers all source types

Related Issues

Labels

feature dependency-management flexibility

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureissue/PR that proposes a new feature or functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions