Skip to content

Sharded DDP training fails with seq2seq models #9156

@sgugger

Description

@sgugger

Information

Model I am using (Bert, XLNet ...): T5/BART/mBART/Marian

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: seq2seq
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Run

python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/finetune_trainer.py \
--model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir \
~/Downloads/wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 \
--logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 \
--num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 4 --sortish_sampler \
--src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 \
--n_train 500 --sharded_ddp

will fail with

Traceback (most recent call last):
File "examples/seq2seq/finetune_trainer.py", line 379, in <module>
main()
File "examples/seq2seq/finetune_trainer.py", line 316, in main
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
File "/home/sgugger/git/transformers/src/transformers/trainer.py", line 821, in train
self.optimizer.step()
File "/home/sgugger/.pyenv/versions/base/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
return wrapped(*args, **kwargs)
File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 210, in step
self._broadcast_params()
File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 522, in _broadcast_params
if self.should_bucket_param[param]:
KeyError: Parameter containing:
tensor([[-0.0296,  0.0038],
[ 0.0000,  0.0000],
[ 0.0298,  0.0385],
...,
[-0.0161, -0.0024],
[ 0.0022, -0.0576],
[ 0.0053,  0.0256]], device='cuda:1')
0%|   

Using FP16 also fails.

Expected behavior

The script should run to completion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions