Skip to content

[trainer] seq2seq doesn't handle mt5 correctly #9865

@mxa4646

Description

@mxa4646

Environment info

  • transformers version: 4.2.2
  • Platform: Linux-5.4.0-58-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.7
  • PyTorch version (GPU?): 1.7.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help

@stas00,@patrickvonplaten, @patil-suraj

Information

Model I am using (MT5-xl,MT5-large):

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (official example scripts task)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. The script I used is exmaples/seq2seq/finetune_trainer.py, which was originally used to reproduce the training of T5-3b on single 3090. All processes are the same as #8771 and it can reproduce the training of T5-3b(whether single card or 2/4 cards).
  2. Here is the problem, when I try to train MT5-xl, --freeze_embeds seems to bring bugs. I used 4*3090, My script is
export BS=1; PYTHONPATH=../../src; USE_TF=0;
/usr/bin/time -v deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path /<my_model_dir>/models/mt5/xl/v0 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16

Here is my report:

[2021-01-27 14:59:52,982] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-01-27 14:59:57,024] [INFO] [runner.py:358:main] cmd = /<my_dir>/miniconda3/envs/nlp/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path /<my_model_dir>/models/mt5/xl/v0 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16
[2021-01-27 14:59:57,793] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-01-27 14:59:57,793] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-01-27 14:59:57,793] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2021-01-27 14:59:57,793] [INFO] [launch.py:100:main] dist_world_size=4
[2021-01-27 14:59:57,793] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2021-01-27 15:00:01,106] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-27 15:00:01,340] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-27 15:00:01,672] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-27 15:00:01,870] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: True
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: True
01/27/2021 15:00:05 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=True, evaluation_strategy=<EvaluationStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_steps=5, logging_dir='runs/Jan27_15-00-01_user-SYS-4029GP-TRT', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', fp16_backend='auto', local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=25000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed='ds_config.json', label_smoothing_factor=0.1, adafactor=False, sortish_sampler=True, predict_with_generate=True)
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:443] 2021-01-27 15:00:05,352 >> loading configuration file /<my_model_dir>/models/mt5/xl/v0/config.json
[INFO|configuration_utils.py:481] 2021-01-27 15:00:05,353 >> Model config MT5Config {
  "_name_or_path": "/home/patrick/t5/mt5-xl",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 5120,
  "d_kv": 64,
  "d_model": 2048,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 24,
  "num_heads": 32,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.2.1",
  "use_cache": true,
  "vocab_size": 250112
}

[INFO|configuration_utils.py:443] 2021-01-27 15:00:05,353 >> loading configuration file /<my_model_dir>/models/mt5/xl/v0/config.json
[INFO|configuration_utils.py:481] 2021-01-27 15:00:05,354 >> Model config MT5Config {
  "_name_or_path": "/home/patrick/t5/mt5-xl",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 5120,
  "d_kv": 64,
  "d_model": 2048,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 24,
  "num_heads": 32,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.2.1",
  "use_cache": true,
  "vocab_size": 250112
}

[INFO|tokenization_utils_base.py:1685] 2021-01-27 15:00:05,354 >> Model name '/<my_model_dir>/models/mt5/xl/v0' not found in model shortcut name list (t5-small, t5-base, t5-large, t5-3b, t5-11b). Assuming '/<my_model_dir>/models/mt5/xl/v0' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,354 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,355 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,355 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/special_tokens_map.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,355 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/tokenizer_config.json. We won't load it.
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file /<my_model_dir>/models/mt5/xl/v0/spiece.model
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|modeling_utils.py:1025] 2021-01-27 15:00:06,472 >> loading weights file /<my_model_dir>/models/mt5/xl/v0/pytorch_model.bin
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
[INFO|modeling_utils.py:1143] 2021-01-27 15:05:03,683 >> All model checkpoint weights were used when initializing MT5ForConditionalGeneration.

[INFO|modeling_utils.py:1152] 2021-01-27 15:05:03,683 >> All the weights of MT5ForConditionalGeneration were initialized from the model checkpoint at /<my_model_dir>/models/mt5/xl/v0.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MT5ForConditionalGeneration for predictions without further training.
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
	Command being timed: "deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path /<my_model_dir>/models/mt5/xl/v0 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16"
	User time (seconds): 348.34
	System time (seconds): 177.55
	Percent of CPU this job got: 166%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 5:15.88
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 33558800
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1
	Minor (reclaiming a frame) page faults: 67111048
	Voluntary context switches: 132337
	Involuntary context switches: 6635761
	Swaps: 0
	File system inputs: 29248712
	File system outputs: 32
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
  1. So I removed --freeze_embeds and tried to train MT5-xl again, but I got CUDA out of memory. My device is 4*24G 3090, with BS=1, ZeRO stage=2, and CPU_offload=true. I assume that T5-3b and MT5-xl should be in the same order of magnitude, and I can do it on t5-3b, so I think this should not happen.
  2. I also tried training MT5-large. Just replace mt5-xl to mt5-large, under the same conditions in 3. And I got the overflow problem. This is not surprising me because MT5-large seems not fixed FP16 yet. In short, I want to know if there is any problem with my operation or if this is the case. If it is because the MT5-large has not been repaired, does huggingface have any plans to repair it?

Expected behavior

  1. Why can't mt5-xl train on 4*3090? Or what should I do?
  2. Can mt5-large FP16 (mainly DeepSpeed) be used? If not, is there any plan to fix it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions