[trainer] remove `--model_parallel` by stas00 · Pull Request #9451 · huggingface/transformers

stas00 · 2021-01-07T04:59:14Z

Per @sgugger's request removing --model_parallel in trainer, as it was never tested or made to work with the trainer.

We will get back to it in the future.

This PR doesn't introduce breaking changes, since --model_parallel never worked (well other than in my MP PRs that have been parked for now, since they are very inefficient and we are looking for a better approach, rather than waste time on sorting those out).

@LysandreJik, @sgugger

LysandreJik

Indeed, LGTM! We should have been more attentive during the review, no harm done.

@sgugger for info, this was removed here: 9f675b0#diff-ed55888e6665791fe92cc8fc0c499da54f4ace6738551cd9a2591881cda076deL245-L248

sgugger · 2021-01-07T12:57:06Z

Thanks for putting it back. Since we're in a PR on this test alone, can we "fix" it to ignore the args.model_parallel argument? This argument will be removed/renamed (I'd prefer the first option as it's not useful) since peoples are confusing it with something that will enable DataParallel. The test can be replaced by model.is_parallelizable and model.parallel I believe, with the current API.

stas00 · 2021-01-07T16:46:12Z

2 things:

you must be referring to self.model_parallel? But it will be always False unless model.parallelize() is called!

So while you can rename the argument, you can't remove it, the user needs to activate this explicitly and the trainer then must activate MP with model.parallelize()

Wrt DataParallel. Why are we turning it on automatically in first place? Why not make it manual and call it --data_parallel - no more confusion. Loud and clear:
- --model_parallel
- --data_parallel
As we discovered last night current trainer doesn't work at all with --model_parallel - see [trainer] deepspeed integration #9211 (comment) there is no activation of that parallel mode - nobody calls model.parallelize() so it's very broken

I change this code last night to;

        if self.args.model_parallel:
            if model.is_parallelizable:
                model.parallelize()
            else:
                raise ValueError(
                    f"{model.__class__.__name__} implementation currently doesn't support model parallelism, therefore --model_parallel cl arg cannot be used"
                )

and it doesn't work when I try:

rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0 ./finetune_trainer.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_train --evaluation_strategy=steps --fp16 --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 4 --per_device_train_batch_size 4 --predict_with_generate --eval_steps 25000 --save_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 1 --n_train 2 --n_val 2 --n_test 2 --do_predict --model_parallel

It doesn't look it ever worked...

i.e. MP works when setup up manually but doesn't work in trainer.

p.s. I tagged you on that discussion - not sure if you saw it.

sgugger · 2021-01-07T17:10:57Z

i.e. MP works when setup up manually but doesn't work in trainer.
As we discovered last night current trainer doesn't work at all with --model_parallel - see #9211 (comment) there is no activation of that parallel mode - nobody calls model.parallelize() so it's very broken

That's not a discovery on my side, that is exactly why I keep saying that the argument --model_parallel should be removed. It doesn't actually do anything and is confusing for the user. The call to model.parallelize() can always be done outside of Trainer IMO, which is why the test can be changed as suggested. We can think of integrating it inside the Trainer later, when the API is stable and actually used, for now I don't see the point of adding this.

Wrt DataParallel. Why are we turning it on automatically in first place? Why not make it manual and call it --data_parallel

That would be a big breaking change in the API, and beginners actually want to have the parallelism work out of the box when they have several GPUs, so I don't see why change something that works.

stas00 · 2021-01-07T17:17:34Z

The call to model.parallelize() can always be done outside of Trainer IMO, which is why the test can be changed as suggested.

It doesn't work

Wrt DataParallel. Why are we turning it on automatically in first place? Why not make it manual and call it --data_parallel
That would be a big breaking change in the API, and beginners actually want to have the parallelism work out of the box when they have several GPUs, so I don't see why change something that works.

OK, then the flag should be there with the default On? Surely a user should be able not to run DP and it's not possible at the moment.

stas00 · 2021-01-07T17:42:31Z

OK, so I did remove --model_parallel - no problem in trainer.py since I used model.is_parallelizable and model.parallel instead - and I now understand that the point is that the user has to activate model.parallelize() themselves before passing the model to the trainer - i.e. no examples scripts will support MP at the moment.

The problem is training_args.py - how do I deal with:

        if not self.model_parallel:
            train_batch_size = per_device_batch_size * max(1, self.n_gpu)
        else:
            train_batch_size = per_device_batch_size

self is args here, and there is no trainer object. Suggestions?

But I guess I need to first figure out how to make MP work in trainer at all, I doesn't look it was ever tried or tested. As it fails for me.

stas00 · 2021-01-07T18:38:34Z

FWIW, --model_parallel works just fine with my Bart MP PR: #9384 (comment) in case someone needs it.

I suspect t5 MP wasn't tested/made to work with generate tools (beam search, etc.) - edit It works now in this PR #9323 - but super slow in beam search!

stas00 · 2021-01-07T21:08:12Z

OK, I committed the bulk of it, and @sgugger will push some magic to deal with training_args.py

tests should be failing I think until he does that.

stas00 · 2021-01-07T21:12:58Z

So now I can see I can jokingly blame my initial mistake on @sgugger since he wanted it removed all along and so I unconsciously did it during rebasing and he unconsciously saw this as the right thing to do during the review ;) It's all Freud's fault anyway ;)

stas00 · 2021-01-07T23:36:56Z

I added a wrapped first, but it looked out of place so I introduced and documented a new attribute: self.is_model_parallel - hope it's loud and clear.

stas00 · 2021-01-07T23:46:08Z

@sgugger, I must be doing something wrong - that docstring section of Important attributes that I started in model_wrapped PR gets wrapped all funny - so I tried to add bullets and then it gets all messed up, as it bunches it all up into one paragraph. If I add new lines then make docs fails. Your magic touch is needed. Thank you.

stas00 · 2021-01-08T01:17:39Z

and here is why I removed init=False in a7a3921

The tests were failing with:

TypeError: __init__() got an unexpected keyword argument '_n_gpu'

https://circle-production-customer-artifacts.s3.amazonaws.com/picard/forks/5bdabdd888af1f000130874a/278[…]cc8b6d6c390aab800d0cc1350f731a19529ac82f48

stas00 · 2021-01-08T18:21:14Z

Thank you for fixing the docs, @sgugger!

LysandreJik

Yes, LGTM!

LysandreJik · 2021-01-11T11:25:22Z

+        if hasattr(model, "is_parallelizable") and model.is_parallelizable and model.model_parallel:
+            self.is_model_parallel = True
+        else:
+            self.is_model_parallel = False


fix bad merge - dropped code

dd4b980

LysandreJik approved these changes Jan 7, 2021

View reviewed changes

remove --model_parallel

5d4fde7

stas00 changed the title ~~[trainer] fix bad rebase - dropped code~~ [trainer] remove --model_parallel Jan 7, 2021

sgugger and others added 6 commits January 7, 2021 17:01

Deal with TrainingArguments

cb8be90

Use a private attr and fix batch sizes

70a9693

fix _n_gpu

a7a3921

add is_parallel helper wrapper

f9a363c

fix attribute

5f273a0

introduce a new attribute is_model_parallel

35aefd4

stas00 added 2 commits January 7, 2021 15:42

docs

f273f2f

docs

8d9b78b

sgugger added 2 commits January 8, 2021 09:54

Put back init False and rearrange doc

ea8c41a

Ignore non-init args in HFArgumentParser

7d7c546

sgugger requested a review from LysandreJik January 8, 2021 15:23

LysandreJik approved these changes Jan 11, 2021

View reviewed changes

sgugger merged commit 33b7422 into huggingface:master Jan 11, 2021

stas00 deleted the revert-is_parallel-check branch January 11, 2021 16:39

sgugger mentioned this pull request Jan 13, 2021

Fix data parallelism in Trainer #9566

Merged

laphang mentioned this pull request Jan 14, 2021

AssertionError with model_parallel in run_clm.py #9243

Closed

4 tasks

Conversation

stas00 commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger commented Jan 7, 2021

Uh oh!

stas00 commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Jan 7, 2021

Uh oh!

stas00 commented Jan 7, 2021

Uh oh!

stas00 commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Jan 7, 2021

Uh oh!

stas00 commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Jan 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Jan 8, 2021

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

LysandreJik Jan 11, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stas00 commented Jan 7, 2021 •

edited

Loading

stas00 commented Jan 7, 2021 •

edited

Loading

stas00 commented Jan 7, 2021 •

edited

Loading

stas00 commented Jan 7, 2021 •

edited

Loading

stas00 commented Jan 7, 2021 •

edited

Loading

stas00 commented Jan 7, 2021 •

edited

Loading

stas00 commented Jan 7, 2021 •

edited

Loading

stas00 commented Jan 8, 2021 •

edited

Loading