Sharded DDP training fails with seq2seq models

## Information

Model I am using (Bert, XLNet ...): T5/BART/mBART/Marian

The problem arises when using:
* [x] the official example scripts: (give details below)
* [ ] my own modified scripts: (give details below)

The tasks I am working on is:
* [x] an official GLUE/SQUaD task: seq2seq
* [ ] my own task or dataset: (give details below)

## To reproduce

Steps to reproduce the behavior:

Run 
```
python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/finetune_trainer.py \
--model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir \
~/Downloads/wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 \
--logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 \
--num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 4 --sortish_sampler \
--src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 \
--n_train 500 --sharded_ddp
```
will fail with
```
Traceback (most recent call last):
File "examples/seq2seq/finetune_trainer.py", line 379, in <module>
main()
File "examples/seq2seq/finetune_trainer.py", line 316, in main
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
File "/home/sgugger/git/transformers/src/transformers/trainer.py", line 821, in train
self.optimizer.step()
File "/home/sgugger/.pyenv/versions/base/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
return wrapped(*args, **kwargs)
File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 210, in step
self._broadcast_params()
File "/home/sgugger/git/fairscale/fairscale/optim/oss.py", line 522, in _broadcast_params
if self.should_bucket_param[param]:
KeyError: Parameter containing:
tensor([[-0.0296,  0.0038],
[ 0.0000,  0.0000],
[ 0.0298,  0.0385],
...,
[-0.0161, -0.0024],
[ 0.0022, -0.0576],
[ 0.0053,  0.0256]], device='cuda:1')
0%|   
```

Using FP16 also fails.

## Expected behavior

The script should run to completion.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharded DDP training fails with seq2seq models #9156

Information

To reproduce

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Sharded DDP training fails with seq2seq models #9156

Description

Information

To reproduce

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions