Skip to content

[Epic] Support RPM-Based Distributions (Rocky Linux, Fedora, RHEL) #569

@ArangoGutierrez

Description

@ArangoGutierrez

Epic: Support RPM-Based Distributions (Rocky Linux, Fedora, RHEL)

Summary

Extend Holodeck's support to include RPM-based Linux distributions such as Rocky Linux, Fedora, and RHEL. This involves adapting all provisioning templates to use DNF/YUM package managers and ensuring compatibility with RPM-based system conventions.

Motivation

Current Holodeck templates are Debian/Ubuntu focused, using apt-get and Debian-specific paths. Many enterprise environments use RHEL-based distributions:

  • Rocky Linux: Free RHEL-compatible enterprise Linux
  • Fedora: Cutting-edge features, good for testing newer software
  • RHEL/CentOS Stream: Enterprise deployments
  • Amazon Linux 2023: AWS-optimized, uses DNF

Supporting RPM distributions enables:

  • Enterprise environment testing
  • AWS-native Amazon Linux support
  • Broader user base coverage
  • Testing across different package ecosystems

Scope

Target Distributions

Distribution Priority Package Manager Notes
Rocky Linux 9 High DNF RHEL 9 compatible
Rocky Linux 8 Medium DNF RHEL 8 compatible
Fedora 40+ Medium DNF Latest features
Amazon Linux 2023 High DNF AWS default
RHEL 9 Low DNF License required

Components to Adapt

  1. Common functions and utilities
  2. NVIDIA Driver installation
  3. Container runtime installation (containerd, Docker, CRI-O)
  4. NVIDIA Container Toolkit installation
  5. Kubernetes installation (kubeadm)

Subtasks

Phase 1: Infrastructure

  • Implement OS family detection

    type OSFamily string
    const (
        OSFamilyDebian OSFamily = "debian"
        OSFamilyRHEL   OSFamily = "rhel"
    )
    
    func DetectOSFamily(env v1alpha1.Environment) (OSFamily, error) {
        // Detection based on:
        // 1. OS field if specified (from AMI mapping)
        // 2. AMI metadata
        // 3. Runtime detection via SSH
    }
  • Implement template variant selection

    type TemplateVariant interface {
        GetDriverTemplate() string
        GetContainerdTemplate() string
        GetDockerTemplate() string
        GetCRIOTemplate() string
        GetToolkitTemplate() string
        GetKubeadmTemplate() string
    }
    
    func SelectTemplateVariant(family OSFamily) TemplateVariant {
        switch family {
        case OSFamilyRHEL:
            return &RPMTemplates{}
        default:
            return &DebianTemplates{}
        }
    }

Phase 2: Common Functions (RPM)

  • Create RPM common functions template
    # CommonFunctionsRPM
    
    export HOLODECK_ENVIRONMENT=true
    
    # RPM package installation with retry
    install_packages_with_retry() {
        local max_retries=5 retry_delay=5
        local packages=("$@")
        
        for ((i=1; i<=max_retries; i++)); do
            echo "[$i/$max_retries] refreshing package cache"
            if sudo dnf makecache; then
                echo "[$i/$max_retries] installing: ${packages[*]}"
                if sudo dnf install -y "${packages[@]}"; then
                    return 0
                fi
            fi
            echo "Attempt $i failed; sleeping ${retry_delay}s" >&2
            sleep "$retry_delay"
        done
        echo "All ${max_retries} attempts failed" >&2
        return 1
    }
    
    # Generic retry function (same as Debian)
    with_retry() {
        # ... same implementation
    }

Phase 3: NVIDIA Driver (RPM)

  • Implement RPM driver template

    # NvDriverTemplateRPM
    
    # Install kernel headers and development tools
    install_packages_with_retry kernel-devel kernel-headers
    install_packages_with_retry gcc make dkms
    
    # Add NVIDIA CUDA repository
    sudo dnf config-manager --add-repo \
        https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
    
    # Install NVIDIA driver
    install_packages_with_retry cuda-drivers{{if .Branch}}-{{.Branch}}{{end}}
    
    # Load kernel module
    sudo modprobe nvidia
    
    # Start persistenced
    sudo nvidia-persistenced --persistence-mode
    
    # Verify
    nvidia-smi
  • Handle distribution-specific paths

    • RHEL 9: rhel9
    • Rocky 9: rhel9 (compatible)
    • Fedora: fedora39 or similar

Phase 4: Container Runtime - containerd (RPM)

  • Implement RPM containerd template
    # ContainerdTemplateRPM
    
    # Add Docker repository for containerd
    sudo dnf config-manager --add-repo \
        https://download.docker.com/linux/centos/docker-ce.repo
    
    # Install containerd
    install_packages_with_retry containerd.io
    
    # Configure containerd
    sudo mkdir -p /etc/containerd
    containerd config default | sudo tee /etc/containerd/config.toml
    
    # Enable SystemdCgroup
    sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
    
    # Enable and start containerd
    sudo systemctl enable --now containerd

Phase 5: Container Runtime - Docker (RPM)

  • Implement RPM Docker template
    # DockerTemplateRPM
    
    # Add Docker repository
    sudo dnf config-manager --add-repo \
        https://download.docker.com/linux/centos/docker-ce.repo
    
    # Install Docker
    install_packages_with_retry docker-ce docker-ce-cli containerd.io
    
    # Configure Docker
    sudo mkdir -p /etc/docker
    cat <<EOF | sudo tee /etc/docker/daemon.json
    {
        "exec-opts": ["native.cgroupdriver=systemd"],
        "storage-driver": "overlay2"
    }
    EOF
    
    # Enable and start Docker
    sudo systemctl enable --now docker
    
    # Add user to docker group
    sudo usermod -aG docker $USER

Phase 6: Container Runtime - CRI-O (RPM)

  • Implement RPM CRI-O template
    # CRIOTemplateRPM
    
    OS="CentOS_9_Stream"  # Adjust based on actual distro
    VERSION="{{.Version}}"
    
    # Add CRI-O repositories
    sudo curl -L -o /etc/yum.repos.d/devel:kubic:libcontainers:stable.repo \
        https://download.opensuse.org/repositories/devel:kubic:libcontainers:stable/${OS}/devel:kubic:libcontainers:stable.repo
    
    sudo curl -L -o /etc/yum.repos.d/devel:kubic:libcontainers:stable:cri-o:${VERSION}.repo \
        https://download.opensuse.org/repositories/devel:kubic:libcontainers:stable:cri-o:${VERSION}/${OS}/devel:kubic:libcontainers:stable:cri-o:${VERSION}.repo
    
    # Install CRI-O
    install_packages_with_retry cri-o cri-tools
    
    # Enable and start CRI-O
    sudo systemctl enable --now crio

Phase 7: NVIDIA Container Toolkit (RPM)

  • Implement RPM toolkit template
    # ContainerToolkitTemplateRPM
    
    # Add NVIDIA container toolkit repository
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/libnvidia-container/${distribution}/libnvidia-container.repo | \
        sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
    
    # Install toolkit
    install_packages_with_retry nvidia-container-toolkit
    
    # Configure runtime
    sudo nvidia-ctk runtime configure --runtime={{.ContainerRuntime}} --set-as-default
    
    # Restart runtime
    sudo systemctl restart {{.ContainerRuntime}}

Phase 8: Kubernetes (RPM)

  • Implement RPM kubeadm template

    # KubeadmTemplateRPM
    
    # Disable swap
    sudo swapoff -a
    sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
    
    # Disable SELinux (or configure properly)
    sudo setenforce 0
    sudo sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config
    
    # Load required kernel modules
    sudo modprobe overlay
    sudo modprobe br_netfilter
    
    cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
    overlay
    br_netfilter
    EOF
    
    # Configure sysctl
    cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
    net.bridge.bridge-nf-call-iptables  = 1
    net.bridge.bridge-nf-call-ip6tables = 1
    net.ipv4.ip_forward                 = 1
    EOF
    sudo sysctl --system
    
    # Add Kubernetes repository
    cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
    [kubernetes]
    name=Kubernetes
    baseurl=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/
    enabled=1
    gpgcheck=1
    gpgkey=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/repodata/repomd.xml.key
    EOF
    
    # Install kubeadm, kubelet, kubectl
    install_packages_with_retry kubelet kubeadm kubectl
    
    # Enable kubelet
    sudo systemctl enable --now kubelet
    
    # Initialize cluster
    # ... (same as Debian from here)
  • Handle SELinux considerations

    • Document SELinux implications
    • Provide both permissive and enforcing options
    • Test with SELinux in enforcing mode

Phase 9: Firewall Configuration

  • Handle firewalld (common on RPM distros)
    # Open required ports for Kubernetes
    sudo firewall-cmd --permanent --add-port=6443/tcp
    sudo firewall-cmd --permanent --add-port=2379-2380/tcp
    sudo firewall-cmd --permanent --add-port=10250/tcp
    sudo firewall-cmd --permanent --add-port=10259/tcp
    sudo firewall-cmd --permanent --add-port=10257/tcp
    sudo firewall-cmd --reload
    
    # Or disable firewalld for testing environments
    # sudo systemctl disable --now firewalld

Phase 10: Testing

  • Create RPM-based test environments

    • Rocky Linux 9 AMI
    • Amazon Linux 2023 AMI
    • Fedora 40 AMI
  • Integration tests per distribution

    Component Rocky 9 AL2023 Fedora
    NVIDIA Driver
    containerd
    Docker
    CRI-O
    Toolkit
    kubeadm
  • E2E tests

    • Full stack provisioning on each distro
    • GPU workload validation
    • Kubernetes cluster functionality

Phase 11: Documentation

  • Update prerequisites

    • Document supported RPM distributions
    • Note any distribution-specific requirements
  • Create distribution-specific guides

    • Rocky Linux setup guide
    • Amazon Linux 2023 guide
    • Known limitations per distro
  • Update examples

    • Add RPM-based example configurations
    • Show OS field usage for RPM distros

Example Configurations

Rocky Linux 9

apiVersion: holodeck.nvidia.com/v1alpha1
kind: Environment
metadata:
  name: rocky-gpu
spec:
  provider: aws
  auth:
    keyName: my-key
    publicKey: ~/.ssh/id_rsa.pub
    privateKey: ~/.ssh/id_rsa
  instance:
    type: g4dn.xlarge
    region: us-west-2
    os: rocky-9
  nvidiaDriver:
    install: true
  containerRuntime:
    install: true
    name: containerd
  nvidiaContainerToolkit:
    install: true
    enableCDI: true
  kubernetes:
    install: true
    installer: kubeadm
    version: v1.31.1

Amazon Linux 2023

apiVersion: holodeck.nvidia.com/v1alpha1
kind: Environment
metadata:
  name: al2023-gpu
spec:
  provider: aws
  auth:
    keyName: my-key
    publicKey: ~/.ssh/id_rsa.pub
    privateKey: ~/.ssh/id_rsa
  instance:
    type: g4dn.xlarge
    region: us-west-2
    os: amazon-linux-2023
  nvidiaDriver:
    install: true
  containerRuntime:
    install: true
    name: containerd

Technical Considerations

Package Manager Differences

Feature APT (Debian) DNF (RHEL)
Update cache apt-get update dnf makecache
Install apt-get install -y dnf install -y
Add repo add-apt-repository dnf config-manager --add-repo
GPG keys apt-key add Built into repo file

SELinux

  • Default enabled on RHEL/Rocky/Fedora
  • May need to be set to permissive for some components
  • Long-term: Proper SELinux policies for GPU workloads

Firewall

  • firewalld vs ufw
  • Port opening differs between systems
  • Consider documenting required ports

Acceptance Criteria

  • Rocky Linux 9 full provisioning works
  • Amazon Linux 2023 full provisioning works
  • All components install correctly on RPM distros
  • os: rocky-9 auto-resolves AMI and username
  • Documentation covers RPM-specific setup
  • E2E tests pass on RPM distributions

Dependencies

Labels

feature compatibility linux-support

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureissue/PR that proposes a new feature or functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions