Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ homo I-OTH
" O
in O
enger O
Auseinandersetzung O
Ause inandersetzung O
mit O
diesem O
Bild O
Expand Down
4 changes: 4 additions & 0 deletions autotuning/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
autotuning_results*
autotuning_exps*
output*
mnli
3 changes: 3 additions & 0 deletions autotuning/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Autotuning Examples

This showcases the [autotuning](https://github.com/microsoft/DeepSpeed/tree/master/deepspeed/autotuning) feature in DeepSpeed (DS).
62 changes: 62 additions & 0 deletions autotuning/hf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Autotuning Hugging Face Examples

This showcases the [autotuning](https://github.com/microsoft/DeepSpeed/tree/master/deepspeed/autotuning) feature in DeepSpeed (DS) with Hugging Face (HF) models.

## List of Models

- [DistilBERT](distilbert)
- [BERT-base](bert-base)
- [BERT-large](bert-large)
- [GPT2](gpt2)
- [GPT2-medium](gpt2-medium)
- [GPT2-large](gpt2-large)
- [GPT2-xl](gpt2-xl)
- [DeBERTa](deberta)

Each model folder has a `test_tune.sh` script:

- `./test_tune.sh tune` tunes the model training and then runs it using the selected tuned DeepSpeed configuration.
- `./test_tune.sh 0` runs the model using HF without DeepSpeed.
- `./test_tune.sh z0` runs the model using HF + DS with ZeRO optimization disabled.
- `./test_tune.sh z1` runs the model using HF + DS with ZeRO optimization stage 1.
- `./test_tune.sh z2` runs the model using HF + DS with ZeRO optimization stage 2.
- `./test_tune.sh z3` runs the model using HF + DS with ZeRO optimization stage 3.


## Testing Environment

The training runs on 1 node with 16 Nvidia V100 GPUs. The autotuning uses the same hardware resource as the training.
The HF packages below are used.

HF examples require installing the `transformers` package from source:
```bash
git clone https://github.com/huggingface/transformers.git
cd transformers
pip install .
```
The `datasets` package can be installed by `pip install datasets`

Below are the versions used in this test.

- transformers (4.12.0)
- datasets (1.11.0)

## Throughput Comparison

The table below shows the throughput (samples per second) comparison. The corresponding train micro-batch size per GPU (mbs or tmbspg) and ZeRO stage used to achieve the throughput value is also shown in the parentheses. Assume the strategy users would use in the handtuning process is to start from `mbs = 1` and increase mbs by 2 each time until running out of GPU memory.
- `baseline` is the vanila HF without DeepSpeed (DS) and mbs is hand-tuned.
- `HF + DS hand-tuned` is HF with DS, and mbs is hand-tuned while other DS configuration uses default values.
- `HF + DS autotuning` is HF with DS, and the DS configuration is selected from autotuning.

Notation: Hugging Face (HF), DeepSpeed (DS), ZeRO stage (z), gradient accumulation steps (gas), train micro-batch size per GPU (mbs or tmbspg).

| Model name | num_params | baseline (vanila HF) | HF + DS hand-tuned | HF + DS autotuning (fast-mode) | throughput improvement over baseline | autotuning time (mins) | number of experiments |
| :----------: | :--------: | :---------------------------: | :----------------------------------: | :----------------------------: | :----------------------------------: | :--------------------: | :-------------------: |
| DistilBERT | 66M | 5161.902 (gas = 1, mbs = 256) | 5305.067 (z = 0, gas = 1 mbs = 256) | 5305.067 (z0_gas1_tmbspg256) | 1.03x | 11 | 11 |
| BERT-base | 0.11B | 2502.236 (gas = 1,mbs = 128) | 2523.684 (z = 0, gas = 1, mbs = 128) | 2736.561 (z0_gas1_tmbspg235) | 1.09x | 35 | 34 |
| BERT-large | 0.34B | 742.692 (gas = 1,mbs = 64) | 766.929 (z = 1, gas = 1, mbs = 64) | 808.168 (z1_gas1_tmbspg93) | 1.09x | 36 | 22 |
| GPT2 | 0.12B | 284.142 (gas = 1,mbs = 8) | 397.827 (z = 1, gas = 1, mbs = 8) | 431.586 (z1_gas1_tmbspg14) | 1.52x | 25 | 17 |
| GPT2-medium | 0.35B | 71.61 (gas = 1, mbs = 2) | 142.211 (z = 1, gas = 1, mbs = 4) | 163.3 (z1_gas1_tmbspg6) | 2.28 | 15 | 25 |
| GPT2-large | 0.77B | 27.874 (gas = 1, mbs = 1) | 56.797 (z = 1, gas = 1, mbs = 2) | 69.061 (z = 1, mbs = 3) | 2.48x | 27 | 13 |
| GPT2-xl | 1.5B | Not runnable | 27.462 (gas = 1, mbs = 1) | 27.497 (z1_gas1_tmbspg1) | inf | 21 | 9 |
| DeBERTa | 1.5B | Not runnable | 140.587 (z = 1, gas = 1 mbs = 8) | 162.395 (z1_gas1_tmbspg11) | inf | 40 | 12 |
58 changes: 58 additions & 0 deletions autotuning/hf/bert-base/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# [bert-base-cased](https://huggingface.co/bert-base-cased)

This model has the following configuration:

- 12-layer
- 768 hidden dimension
- 12 attention heads
- 110M parameters.

## Environment

The training use fp32 and runs on 1 node with 16 Nvidia V100 GPUs. The autotuning uses the same hardware resource as the training. `max_train_batch_size` is set to `4096`.
The HF packages below are used.

HF examples require installing the `transformers` package from source:
```bash
git clone https://github.com/huggingface/transformers.git
cd transformers
pip install .
```
The `datasets` package can be installed by `pip install datasets`

Below are the versions used in this test.

- transformers (4.12.0)
- datasets (1.11.0)

## Throughput Comparison

The table below shows the throughput (samples per second) comparison. The corresponding train micro-batch size per GPU (mbs or tmbspg) and ZeRO stage used to achieve the throughput value is also shown in the parentheses. Assume the strategy users would use in the handtuning process is to start from `mbs = 1` and increase mbs by 2 each time until running out of GPU memory.
- `baseline` is the vanila HF without DeepSpeed (DS) and mbs is hand-tuned.
- `HF + DS hand-tuned` is HF with DS, and mbs is hand-tuned while other DS configuration uses default values.
- `HF + DS autotuning` is HF with DS, and the DS configuration is selected from autotuning.

Notation: Hugging Face (HF), DeepSpeed (DS), ZeRO stage (z), gradient accumulation steps (gas), train micro-batch size per GPU (mbs or tmbspg).

| Model name | baseline (vanila HF) | HF + DS handtuned | HF + DS autotuning |
| ---------- | ----------------------------- | ------------------------------------ | ---------------------------- |
| BERT-base | 2502.236 (gas = 1, mbs = 128) | 2523.684 (z = 0, gas = 1, mbs = 128) | 2736.561 (z0_gas1_tmbspg235) |

## Detailed `HF + DS autotuning` Result Summary

Note that the performance metric used in autotuning is calculated using the timings captured within DeepSpeed forward, backward, and step functions. The sum of these timings is less than the actual training step latency, thus the throughput metric values used by autotuning would be higher than the end-to-end throughput in training.

- Fast-mode Autotuning time: 35 mins
- Number of experiments: 34
- Throughput Improvement over baseline: 1.09x


| tuning_space | num_experiments | best_metric_val | best_exp_name |
| :----------- | --------------: | --------------: | :---------------- |
| z0 | 9 | 2930.18 | z0_gas1_tmbspg235 |
| z1 | 7 | 2930.17 | z1_gas1_tmbspg235 |
| z2 | 8 | 2744.16 | z2_gas1_tmbspg235 |
| z3 | 10 | 2479.47 | z3_gas1_tmbspg238 |
| global | 34 | 2930.18 | z0_gas1_tmbspg235 |

Tuning completed in 0:34:41.842250. Total number of experiments: 34.
12 changes: 12 additions & 0 deletions autotuning/hf/bert-base/ds_config_tune.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"train_micro_batch_size_per_gpu": "auto",
"autotuning": {
"enabled": true,
"overwrite": false,
"max_train_batch_size": 4096,
"arg_mappings": {
"train_micro_batch_size_per_gpu": "--per_device_train_batch_size",
"gradient_accumulation_steps ": "--gradient_accumulation_steps"
}
}
}
114 changes: 114 additions & 0 deletions autotuning/hf/bert-base/test_tune.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
TASK_NAME=mnli
MODEL_NAME=bert-base-cased
HF_PATH=~/projects
PER_DEVICE_TRAIN_BATCH_SIZE=64
MAX_TRAIN_BATCH_SIZE=4096
NEPOCHS=1
NGPUS=16
NNODES=1
MAX_STEPS=200
OUTPUT_DIR=./${TASK_NAME}/output_b${PER_DEVICE_TRAIN_BATCH_SIZE}_g${NGPUS}_$MAX_STEPS

TEST=$1

if [ ${TEST} == "0" ]
then
python -m torch.distributed.launch --nproc_per_node=$NGPUS $HF_PATH/transformers/examples/pytorch/text-classification/run_glue.py \
--model_name_or_path $MODEL_NAME \
--task_name $TASK_NAME \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
--learning_rate 2e-5 \
--num_train_epochs $NEPOCHS \
--output_dir ${OUTPUT_DIR}_b${PER_DEVICE_TRAIN_BATCH_SIZE}_0 \
--overwrite_output_dir \
--save_steps 0 \
--max_steps $MAX_STEPS \
--save_strategy "no"
elif [ ${TEST} == "z0" ]
then
deepspeed --num_nodes=$NNODES $HF_PATH/transformers/examples/pytorch/text-classification/run_glue.py --deepspeed ../dsconfigs/ds_config_z0.json \
--model_name_or_path $MODEL_NAME \
--task_name $TASK_NAME \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
--learning_rate 2e-5 \
--num_train_epochs $NEPOCHS \
--output_dir ${OUTPUT_DIR}_b${PER_DEVICE_TRAIN_BATCH_SIZE}_z0 \
--save_steps 0 \
--overwrite_output_dir \
--max_steps $MAX_STEPS
elif [ ${TEST} == "z1" ]
then
deepspeed --num_nodes=$NNODES $HF_PATH/transformers/examples/pytorch/text-classification/run_glue.py --deepspeed ../dsconfigs/ds_config_z1.json \
--model_name_or_path $MODEL_NAME \
--task_name $TASK_NAME \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
--learning_rate 2e-5 \
--num_train_epochs $NEPOCHS \
--output_dir ${OUTPUT_DIR}_b${PER_DEVICE_TRAIN_BATCH_SIZE}_z1 \
--save_steps 0 \
--overwrite_output_dir \
--max_steps $MAX_STEPS
elif [ ${TEST} == "z2" ]
then
deepspeed --num_nodes=$NNODES $HF_PATH/transformers/examples/pytorch/text-classification/run_glue.py --deepspeed ../dsconfigs/ds_config_z2.json \
--model_name_or_path $MODEL_NAME \
--task_name $TASK_NAME \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
--learning_rate 2e-5 \
--num_train_epochs $NEPOCHS \
--output_dir ${OUTPUT_DIR}_b${PER_DEVICE_TRAIN_BATCH_SIZE}_z2 \
--save_steps 0 \
--overwrite_output_dir \
--max_steps $MAX_STEPS
elif [ ${TEST} == "z3" ]
then
deepspeed --num_nodes=$NNODES $HF_PATH/transformers/examples/pytorch/text-classification/run_glue.py --deepspeed ../dsconfigs/ds_config_z3.json \
--model_name_or_path $MODEL_NAME \
--task_name $TASK_NAME \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
--learning_rate 2e-5 \
--num_train_epochs $NEPOCHS \
--output_dir ${OUTPUT_DIR}_b${PER_DEVICE_TRAIN_BATCH_SIZE}_z3 \
--save_steps 0 \
--overwrite_output_dir \
--max_steps $MAX_STEPS
elif [ ${TEST} == "tune" ]
then
deepspeed --autotuning run --num_nodes=$NNODES $HF_PATH/transformers/examples/pytorch/text-classification/run_glue.py --deepspeed ./ds_config_tune.json \
--model_name_or_path $MODEL_NAME \
--task_name $TASK_NAME \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
--learning_rate 2e-5 \
--num_train_epochs $NEPOCHS \
--output_dir ${OUTPUT_DIR}_tune \
--save_steps 0 \
--overwrite_output_dir \
--max_steps $MAX_STEPS
elif [ ${TEST} == "fs" ]
then
python -m torch.distributed.launch --nproc_per_node=$NGPUS $HF_PATH/transformers/examples/pytorch/text-classification/run_glue.py \
--model_name_or_path $MODEL_NAME \
--task_name $TASK_NAME \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
--learning_rate 2e-5 \
--num_train_epochs $NEPOCHS \
--output_dir ${OUTPUT_DIR}_b${PER_DEVICE_TRAIN_BATCH_SIZE}_fs \
--overwrite_output_dir \
--save_steps 0 \
--max_steps $MAX_STEPS \
--sharded_ddp zero_dp_2
fi
55 changes: 55 additions & 0 deletions autotuning/hf/bert-large/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# [bert-large-uncased](https://huggingface.co/bert-large-uncased)

This model has the following configuration:

- 24-layer
- 1024 hidden dimension
- 16 attention heads
- 336M parameters

The training use fp32 and runs on 1 node with 16 Nvidia V100 GPUs. The autotuning uses the same hardware resource as the training. `max_train_batch_size` is not defined.
The HF packages below are used.

HF examples require installing the `transformers` package from source:
```bash
git clone https://github.com/huggingface/transformers.git
cd transformers
pip install .
```
The `datasets` package can be installed by `pip install datasets`

Below are the versions used in this test.

- transformers (4.12.0)
- datasets (1.11.0)

## Throughput Comparison

The table below shows the throughput (samples per second) comparison. The corresponding train micro-batch size per GPU (mbs or tmbspg) and ZeRO stage used to achieve the throughput value is also shown in the parentheses. Assume the strategy users would use in the handtuning process is to start from `mbs = 1` and increase mbs by 2 each time until running out of GPU memory.
- `baseline` is the vanila HF without DeepSpeed (DS) and mbs is hand-tuned.
- `HF + DS hand-tuned` is HF with DS, and mbs is hand-tuned while other DS configuration uses default values.
- `HF + DS autotuning` is HF with DS, and the DS configuration is selected from autotuning.

Notation: Hugging Face (HF), DeepSpeed (DS), ZeRO stage (z), gradient accumulation steps (gas), train micro-batch size per GPU (mbs or tmbspg).

| Model name | baseline (vanila HF) | HF + DS handtuned | HF + DS autotuning |
| ---------- | --------------------------- | --------------------------------- | -------------------------- |
| BERT-large | 742.692 (gas = 1, mbs = 64) | 766.929 (z = 1, gas =1, mbs = 64) | 808.168 (z1_gas1_tmbspg93) |

## Detailed `HF + DS autotuning` Result Summary

Note that the performance metric used in autotuning is calculated using the timings captured within DeepSpeed forward, backward, and step functions. The sum of these timings is less than the actual training step latency, thus the throughput metric values used by autotuning would be higher than the end-to-end throughput in training.

- Fast-mode Autotuning time: 36 mins
- Number of experiments: 22
- Throughput Improvement over baseline: 1.09x

| tuning_space | num_experiments | best_metric_val | best_exp_name |
| :----------- | --------------: | --------------: | :--------------- |
| z0 | 6 | 835.244 | z0_gas1_tmbspg93 |
| z1 | 6 | 842.243 | z1_gas1_tmbspg93 |
| z2 | 9 | 764.524 | z2_gas1_tmbspg94 |
| z3 | 1 | 0 | z3_gas1_tmbspg94 |
| global | 22 | 842.243 | z1_gas1_tmbspg93 |

Tuning completed in 0:36:16.261417. Total number of experiments: 23.
11 changes: 11 additions & 0 deletions autotuning/hf/bert-large/ds_config_tune.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"train_micro_batch_size_per_gpu": "auto",
"autotuning": {
"enabled": true,
"overwrite": false,
"arg_mappings": {
"train_micro_batch_size_per_gpu": "--per_device_train_batch_size",
"gradient_accumulation_steps ": "--gradient_accumulation_steps"
}
}
}
Loading