Supporting AMD GPUs

We added some `g4ad` AWS instances to our development cluster so we could experiment more with AMD GPUs. Just wanted to document some of the steps that had to be taken here.

## Installing the drivers

We started from a Rocky 9.6 image. The installation steps are reasonably straightforward but building the kernel modules is quite slow:
```
sudo yum update

# See https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html#rocm-installation
sudo dnf install https://repo.radeon.com/amdgpu-install/6.4.1/rhel/9.6/amdgpu-install-6.4.60401-1.el9.noarch.rpm

# Install the drivers
sudo amdgpu-install --usecase=dkms  # this takes ages
# Needs a reboot after installing the kernel module if the kernel is updated during the yum update
# sudo reboot

# Test that we can see the GPU
sudo yum install amd-smi-lib libdrm libdrm-amdgpu
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib:/opt/rocm/lib64
/opt/rocm/bin/amd-smi list
```

Until there is support in Magic Castle for this part, you would need to create a custom image using this approach (via `prepare4image.sh` workflow).

## Making Slurm aware of the GPU on the node

I did this in a very hack-ish way (acting based only on the node name), but basically you need to add the node capabilities to the `gres.conf` of the management node and on the compute node itself. The template `gres.conf.epp` was
```
###########################################################
# Slurm's Generic Resource (GRES) configuration file
###########################################################
AutoDetect=off
<% $nodes.each |$name, $attr| { -%>
<% if $name =~ /rocm/ { -%>
NodeName=<%= $name %> Name=gpu Type=rocm Count=1 File=/dev/kfd
<% } elsif $attr['specs']['gpus'] > 0 { -%>
<% if $attr['specs']['mig'] and ! $attr['specs']['mig'].empty { -%>
<% $attr['specs']['mig'].map |$key, $value| { -%>
NodeName=<%= $name %> Name=gpu Type=<%= $key %> Count=<%= $value * $attr['specs']['gpus'] %> File=<%= join(range(0, $value * $attr['specs']['gpus'] - 1).map |$i| { "/dev/nvidia-mig-${key}-${i}" }, ',') %>
<% } -%>
<% } else { -%>
NodeName=<%= $name %> Name=gpu Count=<%= $attr['specs']['gpus'] %> File=<%= join(range(0, $attr['specs']['gpus'] - 1).map |$i| { "/dev/nvidia${i}" }, ',') %>
<% } -%>
<% } -%>
<% } -%>
```
This automatically fixes `gres.conf` on the management node, but to get a `gres.conf` for the compute nodes I needed to tweak `slurm.pp` to include
```
if $facts['networking']['hostname'] =~ /rocm/ {
  file { '/etc/slurm/gres.conf':
    ensure  => 'present',
    owner   => 'slurm',
    group   => 'slurm',
    content => epp('profile/slurm/gres.conf', {
      'nodes' => {
        $facts['networking']['hostname'] => {},
      },
    }),
    seltype => 'etc_t',
  }
}
```

This is enough to allow Slurm to register the node, and then in an interactive job you can see
```
[ocaisa@x86-64-rocm-zen2-node1 ~]$ scontrol show node x86-64-rocm-zen2-node1
NodeName=x86-64-rocm-zen2-node1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=8 CPUEfctv=8 CPUTot=8 CPULoad=0.07
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:1
   NodeAddr=10.0.0.12 NodeHostName=x86-64-rocm-zen2-node1 Version=24.05.8
   OS=Linux 5.14.0-570.25.1.el9_6.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Jul 7 18:09:10 UTC 2025
   RealMemory=32768 AllocMem=28672 FreeMem=28818 Sockets=8 Boards=1
   MemSpecLimit=512
   State=ALLOCATED+CLOUD ThreadsPerCore=1 TmpDisk=0 Weight=5 Owner=N/A MCS_label=N/A
   Partitions=cpubase_bycore_b1,x86-64-rocm-zen2-node
   BootTime=2025-07-18T14:49:28 SlurmdStartTime=2025-07-18T14:52:17
   LastBusyTime=2025-07-18T14:52:17 ResumeAfterTime=None
   CfgTRES=cpu=8,mem=32G,billing=8,gres/gpu=1
   AllocTRES=cpu=8,mem=28G
   CurrentWatts=0 AveWatts=0


[ocaisa@x86-64-rocm-zen2-node1 ~]$ /opt/rocm/bin/amd-smi list
WARNING: User is missing the following required groups: render, video. Please add user to these groups.
GPU: 0
    BDF: 0000:00:1e.0
    UUID: 73ff7362-0000-1000-802c-73466b6c6923
    KFD_ID: 21974
    NODE_ID: 1
    PARTITION_ID: 0
```
As you can see it complains about not being a member of the groups that are allowed to use the GPU. One way to circumvent this is https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/prerequisites.html#grant-gpu-access-to-all-users-on-the-system

The general structure for the devices is 
```
/dev/kfd
/dev/dri/card0
/dev/dri/renderD128
...
/dev/dri/cardN
/dev/dri/renderD<128+N>
/dev/by-path/pci-<bus>:00.0-card0
/dev/by-path/pci-<bus>:00.0-render0
...
/dev/by-path/pci-<bus>:00.0-cardN
/dev/by-path/pci-<bus>:00.0-renderN
```
where, to my understanding, if you are computing on the GPU `N` you only need access to `/dev/dri/renderD<128+N>`. Above I used `/dev/kfd` which gives access to all GPUs, and since I only have one it makes no difference. If you had multiple GPUs you probably want a `gres.conf` like
```
Name=gpu Type=amd File=/dev/dri/renderD128
Name=gpu Type=amd File=/dev/dri/renderD129
```
so you can schedule individual GPUs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Supporting AMD GPUs #460

Installing the drivers

Making Slurm aware of the GPU on the node

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Supporting AMD GPUs #460

Description

Installing the drivers

Making Slurm aware of the GPU on the node

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions