Training with DeepSpeed takes more GPU memory than without DeepSpeed

## Environment info


- `transformers` version: 4.5.0.dev0
- deepspeed version: 0.3.13
- Platform: Linux-4.15.0-66-generic-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.8
- PyTorch version (GPU?): 1.8.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no

### Who can help
 @stas00



## Information

I'm interested in training the large T5 models with deepspeed and huggingface. More specifically, I'm interested in fine-tuning a T5-11B model on one RTX-8000 48 GB GPU (similarly to https://huggingface.co/blog/zero-deepspeed-fairscale, https://github.com/huggingface/transformers/issues/9996). 

However, when I try to use deepspeed the amount of memory on the GPU increases. For example, running the example seq2seq/run_summarization.py script with T5-Small and without deepspeed takes ~6GB, and running it with deepspeed takes ~8GB.

Model I am using: T5

The problem arises when using: The official  examples/seq2seq/run_summarization.py script.

Without deepspeed:
python examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train  --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0"  --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir  --predict_with_genera

With deepspeed:
deepspeed examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train  --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0"  --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir  --predict_with_generate --deepspeed examples/tests/deepspeed/ds_config.json

The tasks I am working on is:
Sequence to sequence generation.

## To reproduce

Steps to reproduce the behavior:

1. Clone transformers repo 
2. Install requirements (including deepspeed: pip install deepspeed)
3. Run summarization example without deeepspeed: 
python examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train  --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0"  --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir  --predict_with_genera
4. Run summarization example with deepspeed: 
deepspeed examples/seq2seq/run_summarization.py --model_name_or_path t5-small --do_train  --do_eval --dataset_name cnn_dailymail --dataset_config "3.0.0"  --source_prefix "summarize: " --output_dir /tmp/tst-summarization --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir  --predict_with_generate --deepspeed examples/tests/deepspeed/ds_config.json

## Expected behavior

I would expect using deepspeed would reduce the amount of memory being used by the GPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with DeepSpeed takes more GPU memory than without DeepSpeed #10929

Environment info

Who can help

Information

To reproduce

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Training with DeepSpeed takes more GPU memory than without DeepSpeed #10929

Description

Environment info

Who can help

Information

To reproduce

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions