Skip to content

Fix data parallelism in Trainer#9566

Merged
sgugger merged 2 commits intomasterfrom
fix_trainer_model_parallel
Jan 13, 2021
Merged

Fix data parallelism in Trainer#9566
sgugger merged 2 commits intomasterfrom
fix_trainer_model_parallel

Conversation

@sgugger
Copy link
Copy Markdown
Collaborator

@sgugger sgugger commented Jan 13, 2021

What does this PR do?

A bug in data parallelism was introduced in #9451 (mostly because of some weird behavior of dataclasses in python) and data was... well not parallelized anymore (more like the batch size ended up divided by the number of GPUs).

This PR fixes that and to make sure it didn't break the behavior introduced in #9451 for model parallelism, adds a multiGPU test (passing locally) to ensure data is not parallelized when the model is parallel.

@sgugger sgugger requested a review from LysandreJik January 13, 2021 14:34
Copy link
Copy Markdown
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, LGTM! Thanks @sgugger!

Comment thread src/transformers/training_args.py Outdated
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
@sgugger sgugger merged commit 04dc65e into master Jan 13, 2021
@sgugger sgugger deleted the fix_trainer_model_parallel branch January 13, 2021 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants