Skip to content

Training with DeepSpeed takes more GPU memory than without DeepSpeed #10929

@oriyor

Description

@oriyor

Environment info

  • transformers version: 4.5.0.dev0
  • deepspeed version: 0.3.13
  • Platform: Linux-4.15.0-66-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.8
  • PyTorch version (GPU?): 1.8.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help

@stas00

Information

I'm interested in training the large T5 models with deepspeed and huggingface. More specifically, I'm interested in fine-tuning a T5-11B model on one RTX-8000 48 GB GPU (similarly to https://huggingface.co/blog/zero-deepspeed-fairscale, #9996).

However, when I try to use deepspeed the amount of memory on the GPU increases. For example, running the example seq2seq/run_summarization.py script with T5-Small and without deepspeed takes ~6GB, and running it with deepspeed takes ~8GB.

Model I am using: T5

The problem arises when using: The official examples/seq2seq/run_summarization.py script.

Without deepspeed:
python examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_genera

With deepspeed:
deepspeed examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --deepspeed examples/tests/deepspeed/ds_config.json

The tasks I am working on is:
Sequence to sequence generation.

To reproduce

Steps to reproduce the behavior:

  1. Clone transformers repo
  2. Install requirements (including deepspeed: pip install deepspeed)
  3. Run summarization example without deeepspeed:
    python examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_genera
  4. Run summarization example with deepspeed:
    deepspeed examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --deepspeed examples/tests/deepspeed/ds_config.json

Expected behavior

I would expect using deepspeed would reduce the amount of memory being used by the GPU.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions