Environment info
transformers version: 4.5.0.dev0
- deepspeed version: 0.3.13
- Platform: Linux-4.15.0-66-generic-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.8
- PyTorch version (GPU?): 1.8.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Who can help
@stas00
Information
I'm interested in training the large T5 models with deepspeed and huggingface. More specifically, I'm interested in fine-tuning a T5-11B model on one RTX-8000 48 GB GPU (similarly to https://huggingface.co/blog/zero-deepspeed-fairscale, #9996).
However, when I try to use deepspeed the amount of memory on the GPU increases. For example, running the example seq2seq/run_summarization.py script with T5-Small and without deepspeed takes ~6GB, and running it with deepspeed takes ~8GB.
Model I am using: T5
The problem arises when using: The official examples/seq2seq/run_summarization.py script.
Without deepspeed:
python examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_genera
With deepspeed:
deepspeed examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --deepspeed examples/tests/deepspeed/ds_config.json
The tasks I am working on is:
Sequence to sequence generation.
To reproduce
Steps to reproduce the behavior:
- Clone transformers repo
- Install requirements (including deepspeed: pip install deepspeed)
- Run summarization example without deeepspeed:
python examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_genera
- Run summarization example with deepspeed:
deepspeed examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --deepspeed examples/tests/deepspeed/ds_config.json
Expected behavior
I would expect using deepspeed would reduce the amount of memory being used by the GPU.
Environment info
transformersversion: 4.5.0.dev0Who can help
@stas00
Information
I'm interested in training the large T5 models with deepspeed and huggingface. More specifically, I'm interested in fine-tuning a T5-11B model on one RTX-8000 48 GB GPU (similarly to https://huggingface.co/blog/zero-deepspeed-fairscale, #9996).
However, when I try to use deepspeed the amount of memory on the GPU increases. For example, running the example seq2seq/run_summarization.py script with T5-Small and without deepspeed takes ~6GB, and running it with deepspeed takes ~8GB.
Model I am using: T5
The problem arises when using: The official examples/seq2seq/run_summarization.py script.
Without deepspeed:
python examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_genera
With deepspeed:
deepspeed examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --deepspeed examples/tests/deepspeed/ds_config.json
The tasks I am working on is:
Sequence to sequence generation.
To reproduce
Steps to reproduce the behavior:
python examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_genera
deepspeed examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --deepspeed examples/tests/deepspeed/ds_config.json
Expected behavior
I would expect using deepspeed would reduce the amount of memory being used by the GPU.