This repository contains the implementation of different parallelism techniques for distributed training of GPT-2. It supports Data Parallel (DDP), Fully Sharded Data Parallel (FSDP), and Tensor Parallel (TP) training, as well as 2D parallelism combinations.
- Multiple NVIDIA GPUs (minimum 2 for single parallelism, minimum 4 for 2D parallelism)
This code was developed and tested on:
- Development: A6000 48GB x 2 or A6000 48GB x 4 instances via Prime Intellect with PyTorch 2.5 CUDA 12.4 image (provided by Hyperstack cloud)
- Final evaluation: 8x A100 (40 GB SXM4) instance through Lambda Labs
# Clone the repository
git clone https://github.com/rushilbhat/parallelism-experiments.git
# Navigate to the project directory
cd parallelism-experiments
# Install dependencies
pip install -r requirements.txtBefore training, you need to prepare the dataset:
# For Shakespeare dataset (small, prepares quickly)
python data/shakespeare/prepare.py
# For OpenWebText dataset (large, takes 20-30 minutes)
python data/openwebtext/prepare.pyThe Shakespeare dataset is a toy dataset whose train and validation splits are generated instantly. OpenWebText is a heavy-duty dataset that takes approximately 20-30 minutes to create.
| Argument | Description |
|---|---|
--tensor_parallel_size |
Degree of tensor parallelism (default: 1) |
--enable_loss_parallelism |
Enable loss parallelism with tensor parallel training |
--data_parallel_size |
Degree of data parallelism (default: 1) |
--data_parallel_type |
Choose data parallelisation strategy: ddp or fsdp |
--implementation |
Choose distributed implementation: custom or pytorch |
--deferred_init |
Delay materialisation of model parameters until sharding is applied |
--gradient_clipping |
Enable gradient clipping during training |
--eval_interval |
Interval between evaluations on validation set (default: 25) |
--dataset |
Choose dataset: shakespeare or openwebtext (default: shakespeare) |
For distributed training, use torchrun to launch the training script across multiple GPUs:
torchrun --standalone --nproc_per_node=N train.py [arguments]Example configurations:
| Configuration | Arguments |
|---|---|
| DDP (2 GPUs) | --data_parallel_size=2 --data_parallel_type=ddp --implementation=custom |
| FSDP (2 GPUs) | --data_parallel_size=2 --data_parallel_type=fsdp --implementation=custom --deferred_init |
| TP (2 GPUs) | --tensor_parallel_size=2 --enable_loss_parallelism |
| FSDP + TP (8 GPUs) | --tensor_parallel_size=2 --data_parallel_size=4 --data_parallel_type=fsdp --deferred_init --dataset=openwebtext |
Unit tests are included for each parallelism implementation and are designed to run on 2 GPUs. To run a specific test:
torchrun --standalone --nproc_per_node=2 -m unittest tests/[filename]test_ddp.py– Tests custom DDP vs PyTorch DDPtest_fsdp.py– Tests custom FSDP vs PyTorch FSDPtest_tp.py– Tests sharding consistency and numerical correctness of tensor parallelism
Training metrics are logged to TensorBoard. To view the logs when running on a remote server:
- Start TensorBoard on the remote server:
tensorboard --logdir=runs --port=6006- Create an SSH tunnel from your local machine to the remote server:
ssh -L 6006:localhost:6006 user@remote-server- Open TensorBoard in your local browser:
http://localhost:6006
- For optimal performance with FSDP, use the
--deferred_initflag to delay parameter materialisation - When using 2D parallelism (TP + DP), ensure that
tensor_parallel_size * data_parallel_size = num_gpus - Tensor parallelism with loss parallelism (
--enable_loss_parallelism) typically provides better performance