Skip to content

Supporting AMD GPUs #460

@ocaisa

Description

@ocaisa

We added some g4ad AWS instances to our development cluster so we could experiment more with AMD GPUs. Just wanted to document some of the steps that had to be taken here.

Installing the drivers

We started from a Rocky 9.6 image. The installation steps are reasonably straightforward but building the kernel modules is quite slow:

sudo yum update

# See https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html#rocm-installation
sudo dnf install https://repo.radeon.com/amdgpu-install/6.4.1/rhel/9.6/amdgpu-install-6.4.60401-1.el9.noarch.rpm

# Install the drivers
sudo amdgpu-install --usecase=dkms  # this takes ages
# Needs a reboot after installing the kernel module if the kernel is updated during the yum update
# sudo reboot

# Test that we can see the GPU
sudo yum install amd-smi-lib libdrm libdrm-amdgpu
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib:/opt/rocm/lib64
/opt/rocm/bin/amd-smi list

Until there is support in Magic Castle for this part, you would need to create a custom image using this approach (via prepare4image.sh workflow).

Making Slurm aware of the GPU on the node

I did this in a very hack-ish way (acting based only on the node name), but basically you need to add the node capabilities to the gres.conf of the management node and on the compute node itself. The template gres.conf.epp was

###########################################################
# Slurm's Generic Resource (GRES) configuration file
###########################################################
AutoDetect=off
<% $nodes.each |$name, $attr| { -%>
<% if $name =~ /rocm/ { -%>
NodeName=<%= $name %> Name=gpu Type=rocm Count=1 File=/dev/kfd
<% } elsif $attr['specs']['gpus'] > 0 { -%>
<% if $attr['specs']['mig'] and ! $attr['specs']['mig'].empty { -%>
<% $attr['specs']['mig'].map |$key, $value| { -%>
NodeName=<%= $name %> Name=gpu Type=<%= $key %> Count=<%= $value * $attr['specs']['gpus'] %> File=<%= join(range(0, $value * $attr['specs']['gpus'] - 1).map |$i| { "/dev/nvidia-mig-${key}-${i}" }, ',') %>
<% } -%>
<% } else { -%>
NodeName=<%= $name %> Name=gpu Count=<%= $attr['specs']['gpus'] %> File=<%= join(range(0, $attr['specs']['gpus'] - 1).map |$i| { "/dev/nvidia${i}" }, ',') %>
<% } -%>
<% } -%>
<% } -%>

This automatically fixes gres.conf on the management node, but to get a gres.conf for the compute nodes I needed to tweak slurm.pp to include

if $facts['networking']['hostname'] =~ /rocm/ {
  file { '/etc/slurm/gres.conf':
    ensure  => 'present',
    owner   => 'slurm',
    group   => 'slurm',
    content => epp('profile/slurm/gres.conf', {
      'nodes' => {
        $facts['networking']['hostname'] => {},
      },
    }),
    seltype => 'etc_t',
  }
}

This is enough to allow Slurm to register the node, and then in an interactive job you can see

[ocaisa@x86-64-rocm-zen2-node1 ~]$ scontrol show node x86-64-rocm-zen2-node1
NodeName=x86-64-rocm-zen2-node1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=8 CPUEfctv=8 CPUTot=8 CPULoad=0.07
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:1
   NodeAddr=10.0.0.12 NodeHostName=x86-64-rocm-zen2-node1 Version=24.05.8
   OS=Linux 5.14.0-570.25.1.el9_6.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Jul 7 18:09:10 UTC 2025
   RealMemory=32768 AllocMem=28672 FreeMem=28818 Sockets=8 Boards=1
   MemSpecLimit=512
   State=ALLOCATED+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=5 Owner=N/A MCS_label=N/A
   Partitions=cpubase_bycore_b1,x86-64-rocm-zen2-node
   BootTime=2025-07-18T14:49:28 SlurmdStartTime=2025-07-18T14:52:17
   LastBusyTime=2025-07-18T14:52:17 ResumeAfterTime=None
   CfgTRES=cpu=8,mem=32G,billing=8,gres/gpu=1
   AllocTRES=cpu=8,mem=28G
   CurrentWatts=0 AveWatts=0


[ocaisa@x86-64-rocm-zen2-node1 ~]$ /opt/rocm/bin/amd-smi list
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
GPU: 0
    BDF: 0000:00:1e.0
    UUID: 73ff7362-0000-1000-802c-73466b6c6923
    KFD_ID: 21974
    NODE_ID: 1
    PARTITION_ID: 0

As you can see it complains about not being a member of the groups that are allowed to use the GPU. One way to circumvent this is https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/prerequisites.html#grant-gpu-access-to-all-users-on-the-system

The general structure for the devices is

/dev/kfd
/dev/dri/card0
/dev/dri/renderD128
...
/dev/dri/cardN
/dev/dri/renderD<128+N>
/dev/by-path/pci-<bus>:00.0-card0
/dev/by-path/pci-<bus>:00.0-render0
...
/dev/by-path/pci-<bus>:00.0-cardN
/dev/by-path/pci-<bus>:00.0-renderN

where, to my understanding, if you are computing on the GPU N you only need access to /dev/dri/renderD<128+N>. Above I used /dev/kfd which gives access to all GPUs, and since I only have one it makes no difference. If you had multiple GPUs you probably want a gres.conf like

Name=gpu Type=amd File=/dev/dri/renderD128
Name=gpu Type=amd File=/dev/dri/renderD129

so you can schedule individual GPUs

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthardwareIssue needs access to specialized hardwarehelp wantedExtra attention is needed

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions