[2021-01-27 14:59:52,982] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-01-27 14:59:57,024] [INFO] [runner.py:358:main] cmd = /<my_dir>/miniconda3/envs/nlp/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path /<my_model_dir>/models/mt5/xl/v0 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16
[2021-01-27 14:59:57,793] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-01-27 14:59:57,793] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-01-27 14:59:57,793] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2021-01-27 14:59:57,793] [INFO] [launch.py:100:main] dist_world_size=4
[2021-01-27 14:59:57,793] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2021-01-27 15:00:01,106] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-27 15:00:01,340] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-27 15:00:01,672] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-27 15:00:01,870] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
01/27/2021 15:00:05 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True
01/27/2021 15:00:05 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: True
01/27/2021 15:00:05 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: True
01/27/2021 15:00:05 - INFO - __main__ - Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=True, evaluation_strategy=<EvaluationStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_steps=5, logging_dir='runs/Jan27_15-00-01_user-SYS-4029GP-TRT', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', fp16_backend='auto', local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=25000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed='ds_config.json', label_smoothing_factor=0.1, adafactor=False, sortish_sampler=True, predict_with_generate=True)
01/27/2021 15:00:05 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:443] 2021-01-27 15:00:05,352 >> loading configuration file /<my_model_dir>/models/mt5/xl/v0/config.json
[INFO|configuration_utils.py:481] 2021-01-27 15:00:05,353 >> Model config MT5Config {
"_name_or_path": "/home/patrick/t5/mt5-xl",
"architectures": [
"T5ForConditionalGeneration"
],
"d_ff": 5120,
"d_kv": 64,
"d_model": 2048,
"decoder_start_token_id": 0,
"dropout_rate": 0.1,
"eos_token_id": 1,
"feed_forward_proj": "gated-gelu",
"initializer_factor": 1.0,
"is_encoder_decoder": true,
"layer_norm_epsilon": 1e-06,
"model_type": "mt5",
"num_decoder_layers": 24,
"num_heads": 32,
"num_layers": 24,
"output_past": true,
"pad_token_id": 0,
"relative_attention_num_buckets": 32,
"tie_word_embeddings": false,
"tokenizer_class": "T5Tokenizer",
"transformers_version": "4.2.1",
"use_cache": true,
"vocab_size": 250112
}
[INFO|configuration_utils.py:443] 2021-01-27 15:00:05,353 >> loading configuration file /<my_model_dir>/models/mt5/xl/v0/config.json
[INFO|configuration_utils.py:481] 2021-01-27 15:00:05,354 >> Model config MT5Config {
"_name_or_path": "/home/patrick/t5/mt5-xl",
"architectures": [
"T5ForConditionalGeneration"
],
"d_ff": 5120,
"d_kv": 64,
"d_model": 2048,
"decoder_start_token_id": 0,
"dropout_rate": 0.1,
"eos_token_id": 1,
"feed_forward_proj": "gated-gelu",
"initializer_factor": 1.0,
"is_encoder_decoder": true,
"layer_norm_epsilon": 1e-06,
"model_type": "mt5",
"num_decoder_layers": 24,
"num_heads": 32,
"num_layers": 24,
"output_past": true,
"pad_token_id": 0,
"relative_attention_num_buckets": 32,
"tie_word_embeddings": false,
"tokenizer_class": "T5Tokenizer",
"transformers_version": "4.2.1",
"use_cache": true,
"vocab_size": 250112
}
[INFO|tokenization_utils_base.py:1685] 2021-01-27 15:00:05,354 >> Model name '/<my_model_dir>/models/mt5/xl/v0' not found in model shortcut name list (t5-small, t5-base, t5-large, t5-3b, t5-11b). Assuming '/<my_model_dir>/models/mt5/xl/v0' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,354 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,355 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,355 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/special_tokens_map.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,355 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/tokenizer_config.json. We won't load it.
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file /<my_model_dir>/models/mt5/xl/v0/spiece.model
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|modeling_utils.py:1025] 2021-01-27 15:00:06,472 >> loading weights file /<my_model_dir>/models/mt5/xl/v0/pytorch_model.bin
Traceback (most recent call last):
File "./finetune_trainer.py", line 367, in <module>
main()
File "./finetune_trainer.py", line 230, in main
freeze_embeds(model)
File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
[INFO|modeling_utils.py:1143] 2021-01-27 15:05:03,683 >> All model checkpoint weights were used when initializing MT5ForConditionalGeneration.
[INFO|modeling_utils.py:1152] 2021-01-27 15:05:03,683 >> All the weights of MT5ForConditionalGeneration were initialized from the model checkpoint at /<my_model_dir>/models/mt5/xl/v0.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MT5ForConditionalGeneration for predictions without further training.
Traceback (most recent call last):
File "./finetune_trainer.py", line 367, in <module>
main()
File "./finetune_trainer.py", line 230, in main
freeze_embeds(model)
File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
freeze_params(model.model.shared)
File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
freeze_params(model.model.shared)
File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
Traceback (most recent call last):
File "./finetune_trainer.py", line 367, in <module>
main()
File "./finetune_trainer.py", line 230, in main
freeze_embeds(model)
File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
freeze_params(model.model.shared)
File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
Traceback (most recent call last):
File "./finetune_trainer.py", line 367, in <module>
main()
File "./finetune_trainer.py", line 230, in main
freeze_embeds(model)
File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
freeze_params(model.model.shared)
File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
Command being timed: "deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path /<my_model_dir>/models/mt5/xl/v0 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16"
User time (seconds): 348.34
System time (seconds): 177.55
Percent of CPU this job got: 166%
Elapsed (wall clock) time (h:mm:ss or m:ss): 5:15.88
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 33558800
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 67111048
Voluntary context switches: 132337
Involuntary context switches: 6635761
Swaps: 0
File system inputs: 29248712
File system outputs: 32
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Environment info
transformersversion: 4.2.2Who can help
@stas00,@patrickvonplaten, @patil-suraj
Information
Model I am using (MT5-xl,MT5-large):
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
exmaples/seq2seq/finetune_trainer.py, which was originally used to reproduce the training of T5-3b on single 3090. All processes are the same as #8771 and it can reproduce the training of T5-3b(whether single card or 2/4 cards).--freeze_embedsseems to bring bugs. I used 4*3090, My script isHere is my report:
--freeze_embedsand tried to train MT5-xl again, but I got CUDA out of memory. My device is 4*24G 3090, with BS=1, ZeRO stage=2, and CPU_offload=true. I assume that T5-3b and MT5-xl should be in the same order of magnitude, and I can do it on t5-3b, so I think this should not happen.Expected behavior