This repository contains Kubernetes job definitions and Python scripts for running GPU-enabled tests and experiments on the NRP Nautilus cluster.
jobs/— YAML job templates to submit to Kubernetesscripts/— Python scripts to be run inside the job containers
-
Clone this repo:
git clone https://github.com/csml-beach/nrp.git cd nrp -
Submit a job:
kubectl apply -f jobs/gpu-test-git.yaml -n csml-beach
-
Monitor the job:
kubectl get pods -n csml-beach kubectl logs <pod-name> -n csml-beach
-
Check results: Output is saved to the PVC mounted at
/mnt/data, e.g.:/mnt/data/output_gpu_test.py.txt
We have set up a ready-to-use PyTorch and Data Science environment utilizing the NRP scientific images.
- Job Template:
jobs/pytorch-gpu-run.yaml - Image:
pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime(Docker Hub) - Test Script:
scripts/pytorch-test.py
- Ensure your script is in the
scripts/directory and pushed to GitHub. - Update the
SCRIPT_NAMEenv var injobs/pytorch-gpu-run.yamlif needed. - Submit the job:
kubectl apply -f jobs/pytorch-gpu-run.yaml -n csml-beach
- Check the results in
/mnt/data/output_pytorch-test.py.txtvia thedebug-shell.
Nautilus uses OIDC for authentication. If you get an invalid_grant error:
- Refresh your config from the Nautilus Portal.
- If you need to switch identities (e.g., from ORCID to CSULB), clear your sessions:
- Ensure you have the
kubeloginplugin installed:brew install int128/kubelogin/kubelogin.
Nautilus enforces strict CPU/Memory limit-to-request ratios (usually 1:1 or up to 1.2).
- Tip: Set
requestsequal tolimitsto ensure your pod is scheduled without being blocked by admission controllers. - Example:
resources: limits: cpu: "1" memory: "2Gi" requests: cpu: "1" memory: "2Gi"
If a new job is stuck in ContainerCreating with a Multi-Attach error, it means a previous pod is still holding the volume on another node.
- Check for terminating pods:
kubectl get pods -n csml-beach - Force delete the stuck pod:
kubectl delete pod <pod-name> -n csml-beach --force --grace-period=0
The default PRP container image (gitlab-registry.nrp-nautilus.io/prp/jupyter-stack/prp) is ~16GB. It is normal for the pod to stay in ContainerCreating for several minutes while the image is pulled to a new node.