Fix Trainer with a parallel model by sgugger · Pull Request #9578 · huggingface/transformers

sgugger · 2021-01-13T23:37:53Z

What does this PR do?

The test introduced in #9566 wasn't actually working as the default batch size is 8, not 16...
So the problem was still there, the reason because _setup_devices in TrainingArguments is a cached_property, so its result is computed once and for all at init. Had to change the behavior slightly, but it should be okay since it's a private method.

Fixes #9577 (model is getting wrapped into DataParallel because the value of self.args.n_gpu is not updated.

sgugger · 2021-01-13T23:41:13Z


        if is_torch_available() and self.device.type != "cuda" and self.fp16:
            raise ValueError("Mixed precision training with AMP or APEX (`--fp16`) can only be used on CUDA devices.")
-        self._n_gpu = torch.cuda.device_count()


Removing from here, this is going to be completely setup in _setup_devices

sgugger · 2021-01-13T23:41:34Z

        model.is_parallelizable = True
        model.model_parallel = True
-        trainer = Trainer(model=model, train_dataset=RegressionDataset(), eval_dataset=RegressionDataset())
+        args = TrainingArguments("./regression", per_device_train_batch_size=16, per_device_eval_batch_size=16)


Make sure the test uses batch sizes of 16.

sgugger · 2021-01-13T23:41:53Z

+        trainer = Trainer(model, args, train_dataset=RegressionDataset(), eval_dataset=RegressionDataset())
        # Check the Trainer was fooled
        self.assertTrue(trainer.is_model_parallel)
+        self.assertEqual(trainer.args.n_gpu, 1)


This was still set to 2 before, so this checks it is indeed 1.

LysandreJik

LGTM, thanks @sgugger

* Fix Trainer with a parallel model * More clean up

Fix Trainer with a parallel model

e89c153

sgugger requested a review from LysandreJik January 13, 2021 23:37

sgugger mentioned this pull request Jan 13, 2021

Trainer is using DataParallel on parallelized models #9577

Closed

sgugger commented Jan 13, 2021

View reviewed changes

More clean up

710176e

sgugger commented Jan 13, 2021

View reviewed changes

stas00 reviewed Jan 13, 2021

View reviewed changes

Comment thread src/transformers/training_args.py

LysandreJik approved these changes Jan 14, 2021

View reviewed changes

LysandreJik merged commit 5e1bea4 into master Jan 14, 2021

LysandreJik deleted the fix_trainer_model_parallel branch January 14, 2021 08:23

LysandreJik pushed a commit that referenced this pull request Jan 14, 2021

Fix Trainer with a parallel model (#9578)

59fbd64

* Fix Trainer with a parallel model * More clean up

stas00 mentioned this pull request Jan 21, 2021

Model Parallelism and Big Models #8771

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Trainer with a parallel model#9578

Fix Trainer with a parallel model#9578
LysandreJik merged 2 commits intomasterfrom
fix_trainer_model_parallel

sgugger commented Jan 13, 2021

Uh oh!

sgugger Jan 13, 2021

Uh oh!

sgugger Jan 13, 2021

Uh oh!

sgugger Jan 13, 2021

Uh oh!

Uh oh!

LysandreJik left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sgugger commented Jan 13, 2021

What does this PR do?

Uh oh!

sgugger Jan 13, 2021

Choose a reason for hiding this comment

Uh oh!

sgugger Jan 13, 2021

Choose a reason for hiding this comment

Uh oh!

sgugger Jan 13, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants