[OSS] Getting rid of the "should bucket" hash table, just use a list +non-trainable param fix by blefaudeux · Pull Request #259 · facebookresearch/fairscale

blefaudeux · 2020-12-17T06:40:25Z

Properly handle all params, with or without requires_grad, HuggingFace has triggered a bug which was always there. Remove the hash table to store the strategy (a bit risky if the tensors are moved around) and replace with a simple list. Make sure that we update the strategy if a new parameter group is added

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes huggingface/transformers#9156 for the multi-node case, ie: with bucketing involved, tested locally by commenting out the no-bucketing statement.
Fixes #258
Makes sure that this case is caught by the unit test, I'm really surprised that this was not covered before, but I did check that master fails on this

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Well, yes 🙃 the fp16 issue is not solved though

cc @stas00

Properly handle all params, with or without requires_grad

stas00 · 2020-12-17T21:35:45Z

Thank you very much for the fix, @blefaudeux!

blefaudeux · 2020-12-19T04:45:59Z

ping reviews, if you don't mind, I would love master to work for HuggingFace

min-xu-ai

Looks sane to me. Again, it seems the changes are centered around having params that doesn't have require_grads. Perhaps more explicit test cases would help prevent regression?

min-xu-ai · 2020-12-19T07:12:02Z

        model.register_buffer("test_buffer", torch.ones((1)) * rank)
        model.to(device)

+        next(model.parameters()).requires_grad = False


add a comment?

Yes, I was about to write that this is the unit test change which makes sure that this is caught in the future. Will do !

blefaudeux added 2 commits December 16, 2020 22:33

Getting rid of the "should bucket" hash table, just use a list

ad28820

Properly handle all params, with or without requires_grad

pre-emptive bugfix

705c188

blefaudeux requested review from joshim5, min-xu-ai and msbaines December 17, 2020 06:40

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 17, 2020

blefaudeux added 2 commits December 16, 2020 22:44

make sure that this case is unit tested

d6aa285

removing print statements

a315ee3

stas00 mentioned this pull request Dec 17, 2020

Sharded DDP training fails with seq2seq models huggingface/transformers#9156

Closed

4 tasks

min-xu-ai approved these changes Dec 19, 2020

View reviewed changes

code review

4b20b03

blefaudeux merged commit ca74ee2 into master Dec 19, 2020

blefaudeux changed the title ~~[OSS] Getting rid of the "should bucket" hash table, just use a list + robustness related fixes~~ [OSS] Getting rid of the "should bucket" hash table, just use a list +non-trainable param fix Dec 19, 2020

blefaudeux deleted the oss_model_change branch December 22, 2020 03:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OSS] Getting rid of the "should bucket" hash table, just use a list +non-trainable param fix#259

[OSS] Getting rid of the "should bucket" hash table, just use a list +non-trainable param fix#259
blefaudeux merged 5 commits intomasterfrom
oss_model_change

blefaudeux commented Dec 17, 2020 •

edited

Loading

Uh oh!

stas00 commented Dec 17, 2020

Uh oh!

blefaudeux commented Dec 19, 2020

Uh oh!

min-xu-ai left a comment

Uh oh!

min-xu-ai Dec 19, 2020

Uh oh!

blefaudeux Dec 19, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

blefaudeux commented Dec 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

What does this PR do?

PR review

Did you have fun?

Uh oh!

stas00 commented Dec 17, 2020

Uh oh!

blefaudeux commented Dec 19, 2020

Uh oh!

min-xu-ai left a comment

Choose a reason for hiding this comment

Uh oh!

min-xu-ai Dec 19, 2020

Choose a reason for hiding this comment

Uh oh!

blefaudeux Dec 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

blefaudeux commented Dec 17, 2020 •

edited

Loading

blefaudeux Dec 19, 2020 •

edited

Loading