Skip to content
This repository was archived by the owner on May 6, 2026. It is now read-only.
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
383 changes: 383 additions & 0 deletions site/content/docs/user/kuberay.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,383 @@
---
title: "Ray on GKE using DraNet"
date: 2025-07-14T10:10:40Z
---

To get started, follow the instructions to create a [GKE cluster with DRA
support and using DraNet](./gke-rdma.md), it is important to follow the
instructions, since there are multiple dependencies on the Kubernetes API
version, the RDMA NCCL installer and the DraNet component.

The worker nodes in this configuration are a4-highgpu-8g instances, each equipped with eight NVIDIA B200 GPUs and eight RDMA-capable RoCE NICs.


### Deploy RayCluster

Install Ray CRDs and the KubeRay operator:

```sh
kubectl create -k "github.com/ray-project/kuberay/ray-operator/config/default?ref=v1.4.1"
```

We create one `ResourceClaimTemplate`, for the RDMA devices on the node, along
with a `DeviceClass` for the RDMA device.

```yaml
apiVersion: resource.k8s.io/v1beta1
kind: DeviceClass
metadata:
name: dranet
spec:
selectors:
- cel:
expression: device.driver == "dra.net"
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
name: all-nic
spec:
spec:
devices:
requests:
- name: nic
deviceClassName: dranet
count: 8
Comment thread
aojea marked this conversation as resolved.
selectors:
- cel:
expression: device.attributes["dra.net"].rdma == true
```

Until the official Ray images support NVIDIA B200 with CUDA capability sm_100
you need to build a custom image:

```dockerfile
FROM rayproject/ray:2.47.1-py39-cu128

USER root

RUN python -m pip install --upgrade pip
RUN pip uninstall cupy-cuda12x -y && conda install -c conda-forge cupy

RUN pip install --no-cache-dir --force-reinstall numpy==1.26.4
RUN pip install --no-cache-dir --force-reinstall scipy==1.11.4

RUN pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128

RUN apt-get update && apt-get -y install libnl-3-200 libnl-route-3-200

USER 1000
```

Install a RayCluster and use the RDMA NICs on the workers nodes, you need to
specify some NCCL environment variables for optimal performance on Google Cloud
RDMA network:

```yaml
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: a4-ray-cluster
spec:
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
template:
spec:
containers:
- name: ray-head
image: aojea/ray:2.44.1-py39-cu128
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
workerGroupSpecs:
- replicas: 2
minReplicas: 0
maxReplicas: 4
groupName: gpu-group
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: aojea/ray:2.44.1-py39-cu128
resources:
limits:
cpu: "200"
memory: "1600Gi"
nvidia.com/gpu: "8"
requests:
cpu: "120"
memory: "1600Gi"
nvidia.com/gpu: "8"
env:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
- name: TORCH_DISTRIBUTED_DEBUG
value: "INFO"
- name: NCCL_DEBUG
value: INFO # Or "WARN", "DEBUG", "TRACE" for more verbosity
- name: NCCL_DEBUG_SUBSYS
value: INIT,NET,ENV,COLL,GRAPH
- name: NCCL_NET
value: gIB
- name: NCCL_CROSS_NIC
value: "0"
- name: NCCL_NET_GDR_LEVEL
value: "PIX"
- name: NCCL_P2P_NET_CHUNKSIZE
value: "131072"
- name: NCCL_NVLS_CHUNKSIZE
value: "524288"
- name: NCCL_IB_ADAPTIVE_ROUTING
value: "1"
- name: NCCL_IB_QPS_PER_CONNECTION
value: "4"
- name: NCCL_IB_TC
value: "52"
- name: NCCL_IB_FIFO_TC
value: "84"
- name: NCCL_TUNER_CONFIG_PATH
value: "/usr/local/gib/configs/tuner_config_a4.txtpb"
volumeMounts:
- name: library-dir-host
mountPath: /usr/local/nvidia
- name: gib
mountPath: /usr/local/gib
- name: shared-memory
mountPath: /dev/shm
resourceClaims:
- name: nics
resourceClaimTemplateName: all-nic
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
volumes:
- name: library-dir-host
hostPath:
path: /home/kubernetes/bin/nvidia
- name: gib
hostPath:
path: /home/kubernetes/bin/gib
- name: shared-memory
emptyDir:
medium: "Memory"
sizeLimit: 250Gi
```

If in a future we want to create smaller workers that use a subset of GPUs in
the Node we should use also the [NVIDIA GPU DRA Driver](./nvidia-dranet.md) to
ensure the allocated GPUs and NICs on the node are aligned for optimal
performance.

Validate the deployment is working checking the Pods status:

```sh
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
a4-ray-cluster-gpu-group-worker-gzzt6 1/1 Running 0 8m11s 10.48.4.6 gke-dranet-aojea-dranet-a4-54bd557d-1blr <none> <none>
a4-ray-cluster-gpu-group-worker-hnsvx 1/1 Running 0 8m11s 10.48.3.6 gke-dranet-aojea-dranet-a4-54bd557d-5w4l <none> <none>
a4-ray-cluster-head 1/1 Running 0 8m11s 10.48.2.6 gke-dranet-aojea-default-pool-7abaddc3-n287 <none> <none>
```

Check if `a4-ray-cluster-head-svc` Service has been created successfully:

```sh
kubectl get services a4-ray-cluster-head-svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
a4-ray-cluster-head-svc ClusterIP None <none> 10001/TCP,8265/TCP,6379/TCP,8080/TCP 13m
```

Identify your RayCluster’s head pod:

```sh
$ export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
$ echo $HEAD_POD
a4-ray-cluster-head
```

Print the cluster resources:

```sh
$ kubectl exec -it $HEAD_POD -- python -c "import pprint; import ray; ray.init(); pprint.pprint(ray.cluster_resources(), sort_dicts=True)"

2025-07-14 10:44:41,326 INFO worker.py:1520 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2025-07-14 10:44:41,327 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.48.2.6:6379...
2025-07-14 10:44:41,343 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at 10.48.2.6:8265
{'CPU': 402.0,
'GPU': 16.0,
'accelerator_type:B200': 2.0,
'memory': 3438653071770.0,
'node:10.48.2.6': 1.0,
'node:10.48.3.6': 1.0,
'node:10.48.4.6': 1.0,
'node:__internal_head__': 1.0,
'object_store_memory': 401148243558.0}
```

Forward the port and check Ray dashboard:

```sh
kubectl port-forward svc/a4-ray-cluster-head-svc 8265:8265
Forwarding from 127.0.0.1:8265 -> 8265
Forwarding from [::1]:8265 -> 8265
Handling connection for 8265
```

#### GPU-to-GPU using Ray Collective Communication Library

Create a python file with the following code named `nccl_allreduce_multigpu.py`:

```python
import ray
import torch
import os

import ray.util.collective as collective


@ray.remote(num_gpus=8)
class Worker:
def __init__(self):
self.send_tensors = []
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:0'))
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:1') * 2)
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:2'))
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:3') * 2)
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:4'))
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:5') * 2)
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:6'))
self.send_tensors.append(torch.ones((4,), dtype=torch.float32, device='cuda:7') * 2)

self.recv = torch.zeros((4,), dtype=torch.float32, device='cuda:0')

def setup(self, world_size, rank):
collective.init_collective_group(world_size, rank, "nccl", "177")
return True

def compute(self):
collective.allreduce_multigpu(self.send_tensors, "177")

cpu_tensors = [t.cpu() for t in self.send_tensors]

return (
cpu_tensors,
self.send_tensors[0].device,
self.send_tensors[1].device,
self.send_tensors[2].device,
self.send_tensors[3].device,
self.send_tensors[4].device,
self.send_tensors[5].device,
self.send_tensors[6].device,
self.send_tensors[7].device,
)

def destroy(self):
collective.destroy_collective_group("177")


if __name__ == "__main__":
ray.init(address="auto")

num_workers = 2
workers = []
init_rets = []

for i in range(num_workers):
w = Worker.remote()
workers.append(w)
init_rets.append(w.setup.remote(num_workers, i))

ray.get(init_rets)
print("Collective groups initialized.")

results = ray.get([w.compute.remote() for w in workers])

print("\n--- Allreduce Results ---")
for i, (tensors_list, *devices) in enumerate(results):
print(f"Worker {i} results:")
for j, tensor in enumerate(tensors_list):
print(f" Tensor {j} (originally on {devices[j]}): {tensor}")

ray.get([w.destroy.remote() for w in workers])
print("\nCollective groups destroyed.")

ray.shutdown()
```

Create Ray job (should be created with the previously port forwarded, in this
case 8265):

```sh
$ ray job submit --address="http://localhost:8265" --runtime-env-json='{"working_dir": ".", "pip": ["torch"]}' -- python nccl_allreduce_multigpu.py
Job submission server address: http://localhost:8265
2025-07-14 17:32:08,731 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_ec361f13f7b82502.zip.
2025-07-14 17:32:08,733 INFO packaging.py:588 -- Creating a file package for local module '.'.

-------------------------------------------------------
Job 'raysubmit_QQTKZQDTDA3ifPMW' submitted successfully
-------------------------------------------------------

Next steps
Query the logs of the job:
ray job logs raysubmit_QQTKZQDTDA3ifPMW
Query the status of the job:
ray job status raysubmit_QQTKZQDTDA3ifPMW
Request the job to be stopped:
ray job stop raysubmit_QQTKZQDTDA3ifPMW

Tailing logs until the job exits (disable with --no-wait):

<snipped>

--- Allreduce Results ---
Worker 0 results:
(Worker pid=3590, ip=10.48.4.17) id=0x15b3, options=0x0, comp_mask=0x0}
(Worker pid=3590, ip=10.48.4.17) a4-ray-cluster-gpu-group-worker-pbkpw:3590:3778 [6] NCCL INFO NET/gIB: IbDev 6 Port 1 qpn 2440 se
Tensor 0 (originally on cuda:0): tensor([24., 24., 24., 24.])
Tensor 1 (originally on cuda:1): tensor([24., 24., 24., 24.])
Tensor 2 (originally on cuda:2): tensor([24., 24., 24., 24.])
Tensor 3 (originally on cuda:3): tensor([24., 24., 24., 24.])
Tensor 4 (originally on cuda:4): tensor([24., 24., 24., 24.])
Tensor 5 (originally on cuda:5): tensor([24., 24., 24., 24.])
Tensor 6 (originally on cuda:6): tensor([24., 24., 24., 24.])
Tensor 7 (originally on cuda:7): tensor([24., 24., 24., 24.])
Worker 1 results:
Tensor 0 (originally on cuda:0): tensor([24., 24., 24., 24.])
Tensor 1 (originally on cuda:1): tensor([24., 24., 24., 24.])
Tensor 2 (originally on cuda:2): tensor([24., 24., 24., 24.])
Tensor 3 (originally on cuda:3): tensor([24., 24., 24., 24.])
Tensor 4 (originally on cuda:4): tensor([24., 24., 24., 24.])
Tensor 5 (originally on cuda:5): tensor([24., 24., 24., 24.])
Tensor 6 (originally on cuda:6): tensor([24., 24., 24., 24.])
Tensor 7 (originally on cuda:7): tensor([24., 24., 24., 24.])

<snipped>
```

Since we are setting the informational NCCL environment variables NCCL_DEBUG and
NCCL_DEBUG_SUBSYS we can verify in the logs that RDMA GPUDirect is being used:

```sh
# [... snipped ...]
# The gIB (InfiniBand) plugin is initialized
[cite_start][... (Worker pid=3590, ip=10.48.4.17) [0m a4-ray-cluster-gpu-group-worker-pbkpw:3590:3753 [2] NCCL INFO NET/gIB : Initializing gIB v1.0.6 [cite: 1887]
[cite_start][... (Worker pid=3590, ip=10.48.4.17) [0m a4-ray-cluster-gpu-group-worker-pbkpw:3590:3753 [2] NCCL INFO Initialized NET plugin gIB [cite: 1889]

# Environment variable for GPU Direct RDMA level is detected
[cite_start][... (Worker pid=3590, ip=10.48.4.17) [0m a4-ray-cluster-gpu-group-worker-pbkpw:3590:3754 [3] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to PIX [cite: 59]

# NCCL confirms that GPU Direct RDMA is enabled for each HCA (NIC) and GPU pairing
[cite_start][... (Worker pid=3590, ip=10.48.4.17) [0m a4-ray-cluster-gpu-group-worker-pbkpw:3590:3758 [7] NCCL INFO NET/gIB : GPU Direct RDMA Enabled for HCA 0 'mlx5_0' [cite: 41]
[cite_start][... (Worker pid=3590, ip=10.48.4.17) [0m a4-ray-cluster-gpu-group-worker-pbkpw:3590:3754 [3] NCCL INFO GPU Direct RDMA Enabled for GPU 7 / HCA 0 (distance 4 <= 4), read 0 mode Default [cite: 66]

# Finally, communication channels are established using GDRDMA
[cite_start][... (Worker pid=3590, ip=10.48.4.17) [0m a4-ray-cluster-gpu-group-worker-pbkpw:3590:3799 [2] NCCL INFO Channel 02/0 : 10[2] -> 2[2] [receive] via NET/gIB/2/GDRDMA [cite: 1734]
[cite_start][... (Worker pid=3590, ip=10.48.4.17) [0m a4-ray-cluster-gpu-group-worker-pbkpw:3590:3799 [2] NCCL INFO Channel 02/0 : 2[2] -> 10[2] [send] via NET/gIB/2/GDRDMA [cite: 1739]
# [... snipped ...]
```
Loading