python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/finetune_trainer.py \
--model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir \
~/Downloads/wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 \
--logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 \
--num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 4 --sortish_sampler \
--src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 \
--n_train 500 --sharded_ddp
Traceback (most recent call last):
File "examples/seq2seq/finetune_trainer.py", line 379, in <module>
main()
File "examples/seq2seq/finetune_trainer.py", line 316, in main
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
File "/home/sgugger/git/transformers/src/transformers/trainer.py", line 821, in train
self.optimizer.step()
File "/home/sgugger/.pyenv/versions/base/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
return wrapped(*args, **kwargs)
File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 210, in step
self._broadcast_params()
File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 522, in _broadcast_params
if self.should_bucket_param[param]:
KeyError: Parameter containing:
tensor([[-0.0296, 0.0038],
[ 0.0000, 0.0000],
[ 0.0298, 0.0385],
...,
[-0.0161, -0.0024],
[ 0.0022, -0.0576],
[ 0.0053, 0.0256]], device='cuda:1')
0%|
Using FP16 also fails.
The script should run to completion.
Information
Model I am using (Bert, XLNet ...): T5/BART/mBART/Marian
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Run
will fail with
Using FP16 also fails.
Expected behavior
The script should run to completion.