Fix data parallelism in Trainer by sgugger · Pull Request #9566 · huggingface/transformers

sgugger · 2021-01-13T14:34:44Z

What does this PR do?

A bug in data parallelism was introduced in #9451 (mostly because of some weird behavior of dataclasses in python) and data was... well not parallelized anymore (more like the batch size ended up divided by the number of GPUs).

This PR fixes that and to make sure it didn't break the behavior introduced in #9451 for model parallelism, adds a multiGPU test (passing locally) to ensure data is not parallelized when the model is parallel.

LysandreJik

Great, LGTM! Thanks @sgugger!

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Fix data parallelism in Trainer

debee36

sgugger requested a review from LysandreJik January 13, 2021 14:34

LysandreJik approved these changes Jan 13, 2021

View reviewed changes

Comment thread src/transformers/training_args.py Outdated

Update src/transformers/training_args.py

84ba194

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

sgugger merged commit 04dc65e into master Jan 13, 2021

sgugger deleted the fix_trainer_model_parallel branch January 13, 2021 14:54

sgugger mentioned this pull request Jan 13, 2021

Fix Trainer with a parallel model #9578

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix data parallelism in Trainer#9566

Fix data parallelism in Trainer#9566
sgugger merged 2 commits intomasterfrom
fix_trainer_model_parallel

sgugger commented Jan 13, 2021

Uh oh!

LysandreJik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sgugger commented Jan 13, 2021

What does this PR do?

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants