[trainer] seq2seq doesn't handle mt5 correctly

## Environment info


- `transformers` version: 4.2.2
- Platform: Linux-5.4.0-58-generic-x86_64-with-debian-buster-sid
- Python version: 3.7.7
- PyTorch version (GPU?): 1.7.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: <yes>
- Using distributed or parallel set-up in script?: <yes>

### Who can help

@stas00,@patrickvonplaten, @patil-suraj

## Information

Model I am using (MT5-xl,MT5-large):

The problem arises when using:
* [x] the official example scripts: (give details below)
* [ ] my own modified scripts: (give details below)


The tasks I am working on is:
* [x] an official GLUE/SQUaD task: (official example scripts task)
* [ ] my own task or dataset: (give details below)

## To reproduce

Steps to reproduce the behavior:

1. The script I used is `exmaples/seq2seq/finetune_trainer.py`, which was originally used to reproduce the training of T5-3b on single 3090. All processes are the same as [#8771](https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685) and it can reproduce the training of T5-3b(whether single card or 2/4 cards).
2. Here is the problem, when I try to train MT5-xl, `--freeze_embeds` seems to bring bugs. I used 4*3090, My script is 
```
export BS=1; PYTHONPATH=../../src; USE_TF=0;
/usr/bin/time -v deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path /<my_model_dir>/models/mt5/xl/v0 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16
```
Here is my report:

```
[2021-01-27 14:59:52,982] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-01-27 14:59:57,024] [INFO] [runner.py:358:main] cmd = /<my_dir>/miniconda3/envs/nlp/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path /<my_model_dir>/models/mt5/xl/v0 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16
[2021-01-27 14:59:57,793] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-01-27 14:59:57,793] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-01-27 14:59:57,793] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2021-01-27 14:59:57,793] [INFO] [launch.py:100:main] dist_world_size=4
[2021-01-27 14:59:57,793] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2021-01-27 15:00:01,106] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-27 15:00:01,340] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-27 15:00:01,672] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-27 15:00:01,870] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: True
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: True
01/27/2021 15:00:05 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=True, evaluation_strategy=<EvaluationStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_steps=5, logging_dir='runs/Jan27_15-00-01_user-SYS-4029GP-TRT', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', fp16_backend='auto', local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=25000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed='ds_config.json', label_smoothing_factor=0.1, adafactor=False, sortish_sampler=True, predict_with_generate=True)
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:443] 2021-01-27 15:00:05,352 >> loading configuration file /<my_model_dir>/models/mt5/xl/v0/config.json
[INFO|configuration_utils.py:481] 2021-01-27 15:00:05,353 >> Model config MT5Config {
  "_name_or_path": "/home/patrick/t5/mt5-xl",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 5120,
  "d_kv": 64,
  "d_model": 2048,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 24,
  "num_heads": 32,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.2.1",
  "use_cache": true,
  "vocab_size": 250112
}

[INFO|configuration_utils.py:443] 2021-01-27 15:00:05,353 >> loading configuration file /<my_model_dir>/models/mt5/xl/v0/config.json
[INFO|configuration_utils.py:481] 2021-01-27 15:00:05,354 >> Model config MT5Config {
  "_name_or_path": "/home/patrick/t5/mt5-xl",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 5120,
  "d_kv": 64,
  "d_model": 2048,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 24,
  "num_heads": 32,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.2.1",
  "use_cache": true,
  "vocab_size": 250112
}

[INFO|tokenization_utils_base.py:1685] 2021-01-27 15:00:05,354 >> Model name '/<my_model_dir>/models/mt5/xl/v0' not found in model shortcut name list (t5-small, t5-base, t5-large, t5-3b, t5-11b). Assuming '/<my_model_dir>/models/mt5/xl/v0' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,354 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,355 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,355 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/special_tokens_map.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,355 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/tokenizer_config.json. We won't load it.
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file /<my_model_dir>/models/mt5/xl/v0/spiece.model
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|modeling_utils.py:1025] 2021-01-27 15:00:06,472 >> loading weights file /<my_model_dir>/models/mt5/xl/v0/pytorch_model.bin
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
[INFO|modeling_utils.py:1143] 2021-01-27 15:05:03,683 >> All model checkpoint weights were used when initializing MT5ForConditionalGeneration.

[INFO|modeling_utils.py:1152] 2021-01-27 15:05:03,683 >> All the weights of MT5ForConditionalGeneration were initialized from the model checkpoint at /<my_model_dir>/models/mt5/xl/v0.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MT5ForConditionalGeneration for predictions without further training.
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
	Command being timed: "deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path /<my_model_dir>/models/mt5/xl/v0 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16"
	User time (seconds): 348.34
	System time (seconds): 177.55
	Percent of CPU this job got: 166%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 5:15.88
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 33558800
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1
	Minor (reclaiming a frame) page faults: 67111048
	Voluntary context switches: 132337
	Involuntary context switches: 6635761
	Swaps: 0
	File system inputs: 29248712
	File system outputs: 32
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
```
3. So I removed `--freeze_embeds` and tried to train MT5-xl again, but I got CUDA out of memory. My device is 4*24G 3090, with BS=1, ZeRO stage=2, and CPU_offload=true. I assume that T5-3b and MT5-xl should be in the same order of magnitude, and I can do it on t5-3b, so I think this should not happen.
4. I also tried training MT5-large. Just replace mt5-xl to mt5-large, under the same conditions in 3. And I got the overflow problem. This is not surprising me because MT5-large seems not fixed FP16 yet. In short, I want to know if there is any problem with my operation or if this is the case. If it is because the MT5-large has not been repaired, does huggingface have any plans to repair it?



## Expected behavior
1. Why can't mt5-xl train on 4*3090? Or what should I do?
2. Can mt5-large FP16 (mainly DeepSpeed) be used? If not, is there any plan to fix it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[trainer] seq2seq doesn't handle mt5 correctly #9865

Environment info

Who can help

Information

To reproduce

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[trainer] seq2seq doesn't handle mt5 correctly #9865

Description

Environment info

Who can help

Information

To reproduce

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions