Skip to content

UFResearchComputing/MultiGPUTraining

Repository files navigation

Distributed Data Parallel Training with Multi-GPU on HiPerGator-AI

  • Yunchao Yang
  • UF Research Computing

The series starts with a non-distributed script that runs on a single GPU and incrementally updates to end with multinode training on a Slurm cluster. Code is forked and adapted for the DDP tutorial series at https://pytorch.org/tutorials/beginner/ddp_series_intro.html

How to use this Repo.

Step 0: navigate to ood.rc.ufl.edu and request a jupyter notebook with 1 node and 2 GPUs:

  • Number of CPUs = 4
  • Maximum memory = 8
  • SLURM Account = ai-workshop
  • QoS = ai-workshop
  • Cluster partition = gpu
  • Generic Resource Request = gpu:geforece:2
  • Additional SLURM Options = --reservation=rc-workshop (only used during the workshop period)

Step 1: Get started with the starter code

  • single_gpu.py: Non-distributed training script on a single GPU
  • [How to run]:
module load pytorch/1.10
python single_gpu.py 50 10` 

Step 2

Exercise 2.1: adapt your serial code to single node multiple processes run with mp.spawn utility and user-specified setting.

You can test your code by run ./exercise1_run_multigpu.sh

Exercise 2.2: adapt your serial code to single node multiple processes using the torchrun utility

You can test your code by run ./exercise2_run_multigpu_torchrun.sh

Step 3. Run MultiNode parallel jobs using SLURM on HPG (work offline)

You can submit your SLURM job script by run sbatch launch_ddp_2N4G.sh

Solutions

You can find the solution to exercises in this folder.

Learn more about the detailed code walkthrough

Please follow the Distributed Data Parallel in PyTorch Tutorial Series.

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published