[OSS] Balance the trainable params only by blefaudeux · Pull Request #262 · facebookresearch/fairscale

blefaudeux · 2020-12-17T08:25:55Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #261, also huggingface/transformers#9156. I suspect that this was also partially to blame with the iGPT trainings

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Yes 🙃 HuggingFace / ShardedDDP / AMP works, at least the dummy example shared

…ve little consequences

stas00 · 2020-12-17T21:35:07Z

Thank you very much for fixing that, @blefaudeux

blefaudeux · 2020-12-18T15:43:31Z

    _params_t = Any


-class BucketFlush(Enum):


dead code removal

blefaudeux · 2020-12-18T15:43:40Z

            if work_handle.callback is not None:
                work_handle.callback()

-    def _handle_trailing_buckets(self, flush_type: BucketFlush) -> None:


same, sorry about that

blefaudeux · 2020-12-18T15:46:10Z

                    param_lists[rank].append(param)
-                    sizes[rank] += param.numel()
+
+                    # We're partitioning the optimizer state,


this is the real change, the partitioning was not taking into account the fact that the params are trainable or not, although this is what counts for the optimizer state. The test case for Huggingface was kind of pathological for that, because there was one big non trainable parameter (goes to rank 0) and then a lot of cumulatively smaller trainable parameters, which all went to rank 1. This meant that the model was effectively optimized on rank 1, hence defeating the whole sharding purpose

blefaudeux · 2020-12-19T04:45:32Z

ping reviews, if you don't mind, I would love master to work for HuggingFace

min-xu-ai · 2020-12-19T07:10:26Z

        params.append(torch.rand(size, 1))
+
+    # Make sure that the params are trainable, enforces size-based partitioning
+    for p in params:


need to add a test case where some params are not trainable too?

blefaudeux · 2020-12-19T20:37:19Z

+
    o = optim.OSS(params, lr=0.1)
    assert len(o.param_groups) == 1
    o.add_param_group({"params": [torch.rand(3, 1)]})


@min-xu-ai this parameter is non trainable actually, so there's a mix of both

blefaudeux · 2020-12-21T16:40:41Z

The CI issue seems unrelated, pipe benchmark and host with an old cuda, I'm missing some context but could have a look late. cc @msbaines just in case

min-xu-ai

Really nice. Thanks for adding comments and tests.

stas00 · 2020-12-22T01:37:51Z

@blefaudeux, when you feel all the important changes have been merged into master - please ping me so that I could re-test with transformers, and then it'd be great to make a new fairscale release on pypi and then we can announce it as working with transformers.

I have a brief doc ready to merge when the above has come to satisfaction huggingface/transformers#9208 (in case you'd like to add anything there please don't hesitate to suggest) and then we can make an announcement that transformers has Sharded ZeRO features from fairscale integrated. Yay!

blefaudeux · 2020-12-22T06:35:30Z

@blefaudeux, when you feel all the important changes have been merged into master - please ping me so that I could re-test with transformers, and then it'd be great to make a new fairscale release on pypi and then we can announce it as working with transformers.

I have a brief doc ready to merge when the above has come to satisfaction huggingface/transformers#9208 (in case you'd like to add anything there please don't hesitate to suggest) and then we can make an announcement that transformers has Sharded ZeRO features from fairscale integrated. Yay!

Hi @stas00, should be good to go ! The current issue with CircleCI is unrelated, problem with the default torch install via pip being incompatible with the provided CUDA on these machines, else the two PRs required for HuggingFace (which were related to genuine issues) are landed. Please keep me posted if you encounter any issues, and thanks for the great work around that ! Looking forward to the announce, I'll have a look :) (spotty availability right now, but doing my best)

stas00 · 2020-12-22T22:15:28Z

That's great!

I rebuilt and retested and everything looks good.

Would it be possible to make a new release on pypi first and then we are good to announce.

Thank you very much!

fix, one liner

232a54b

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 17, 2020

blefaudeux changed the title ~~fix, one liner~~ [OSS] Balance the optimizable params only, they count for optim state Dec 17, 2020

blefaudeux changed the title ~~[OSS] Balance the optimizable params only, they count for optim state~~ [OSS] Balance the optimizable params only Dec 17, 2020

blefaudeux requested a review from msbaines December 17, 2020 08:26

adjust so that frozen trunks get spread still, even if this should ha…

e305f10

…ve little consequences

stas00 mentioned this pull request Dec 17, 2020

Sharded DDP training fails with seq2seq models huggingface/transformers#9156

Closed

4 tasks

blefaudeux changed the title ~~[OSS] Balance the optimizable params only~~ [OSS] Balance the trainable params only Dec 18, 2020

blefaudeux added 2 commits December 18, 2020 07:12

removing dead code, hopeful unit test fix

ba09a9a

now with some linting..

c8e2169

blefaudeux requested a review from min-xu-ai December 18, 2020 15:43

blefaudeux commented Dec 18, 2020

View reviewed changes

Comment thread fairscale/optim/oss.py

_params_t = Any

class BucketFlush(Enum):

Copy link
Copy Markdown

Contributor Author

blefaudeux Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code removal

blefaudeux commented Dec 18, 2020

View reviewed changes

blefaudeux requested a review from joshim5 December 19, 2020 04:45

min-xu-ai reviewed Dec 19, 2020

View reviewed changes

blefaudeux commented Dec 19, 2020

View reviewed changes

blefaudeux added 2 commits December 20, 2020 21:34

Merge branch 'master' into oss_partition_huggingface

ca194ec

adding a proper unit test case

a5c83a2

blefaudeux requested a review from min-xu-ai December 21, 2020 16:40

min-xu-ai approved these changes Dec 21, 2020

View reviewed changes

blefaudeux merged commit c386e93 into master Dec 22, 2020

blefaudeux deleted the oss_partition_huggingface branch December 22, 2020 03:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OSS] Balance the trainable params only#262

[OSS] Balance the trainable params only#262
blefaudeux merged 6 commits intomasterfrom
oss_partition_huggingface

blefaudeux commented Dec 17, 2020 •

edited

Loading

Uh oh!

stas00 commented Dec 17, 2020

Uh oh!

blefaudeux Dec 18, 2020

Uh oh!

blefaudeux Dec 18, 2020

Uh oh!

blefaudeux Dec 18, 2020 •

edited

Loading

Uh oh!

blefaudeux commented Dec 19, 2020

Uh oh!

min-xu-ai Dec 19, 2020

Uh oh!

blefaudeux Dec 19, 2020

Uh oh!

blefaudeux commented Dec 21, 2020

Uh oh!

min-xu-ai left a comment

Uh oh!

stas00 commented Dec 22, 2020 •

edited

Loading

Uh oh!

blefaudeux commented Dec 22, 2020 •

edited

Loading

Uh oh!

stas00 commented Dec 22, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

blefaudeux commented Dec 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

What does this PR do?

PR review

Did you have fun?

Uh oh!

stas00 commented Dec 17, 2020

Uh oh!

blefaudeux Dec 18, 2020

Choose a reason for hiding this comment

Uh oh!

blefaudeux Dec 18, 2020

Choose a reason for hiding this comment

Uh oh!

blefaudeux Dec 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blefaudeux commented Dec 19, 2020

Uh oh!

min-xu-ai Dec 19, 2020

Choose a reason for hiding this comment

Uh oh!

blefaudeux Dec 19, 2020

Choose a reason for hiding this comment

Uh oh!

blefaudeux commented Dec 21, 2020

Uh oh!

min-xu-ai left a comment

Choose a reason for hiding this comment

Uh oh!

stas00 commented Dec 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blefaudeux commented Dec 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Dec 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

blefaudeux commented Dec 17, 2020 •

edited

Loading

blefaudeux Dec 18, 2020 •

edited

Loading

stas00 commented Dec 22, 2020 •

edited

Loading

blefaudeux commented Dec 22, 2020 •

edited

Loading

stas00 commented Dec 22, 2020 •

edited

Loading