Skip to content

[OSS] Getting rid of the "should bucket" hash table, just use a list +non-trainable param fix#259

Merged
blefaudeux merged 5 commits intomasterfrom
oss_model_change
Dec 19, 2020
Merged

[OSS] Getting rid of the "should bucket" hash table, just use a list +non-trainable param fix#259
blefaudeux merged 5 commits intomasterfrom
oss_model_change

Conversation

@blefaudeux
Copy link
Copy Markdown
Contributor

@blefaudeux blefaudeux commented Dec 17, 2020

Properly handle all params, with or without requires_grad, HuggingFace has triggered a bug which was always there. Remove the hash table to store the strategy (a bit risky if the tensors are moved around) and replace with a simple list. Make sure that we update the strategy if a new parameter group is added

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
  • Did you read the contributor guideline?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes huggingface/transformers#9156 for the multi-node case, ie: with bucketing involved, tested locally by commenting out the no-bucketing statement.
Fixes #258
Makes sure that this case is caught by the unit test, I'm really surprised that this was not covered before, but I did check that master fails on this

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Well, yes 🙃 the fp16 issue is not solved though

cc @stas00

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 17, 2020
@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Dec 17, 2020

Thank you very much for the fix, @blefaudeux!

@blefaudeux
Copy link
Copy Markdown
Contributor Author

ping reviews, if you don't mind, I would love master to work for HuggingFace

Copy link
Copy Markdown
Contributor

@min-xu-ai min-xu-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sane to me. Again, it seems the changes are centered around having params that doesn't have require_grads. Perhaps more explicit test cases would help prevent regression?

model.register_buffer("test_buffer", torch.ones((1)) * rank)
model.to(device)

next(model.parameters()).requires_grad = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment?

Copy link
Copy Markdown
Contributor Author

@blefaudeux blefaudeux Dec 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was about to write that this is the unit test change which makes sure that this is caught in the future. Will do !

@blefaudeux blefaudeux merged commit ca74ee2 into master Dec 19, 2020
@blefaudeux blefaudeux changed the title [OSS] Getting rid of the "should bucket" hash table, just use a list + robustness related fixes [OSS] Getting rid of the "should bucket" hash table, just use a list +non-trainable param fix Dec 19, 2020
@blefaudeux blefaudeux deleted the oss_model_change branch December 22, 2020 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[OSS] Params without requires_grad can slip out of the bucketing table, and break Sharded DDP training fails with seq2seq models

4 participants