Skip to content

[OSS] Balance the trainable params only#262

Merged
blefaudeux merged 6 commits intomasterfrom
oss_partition_huggingface
Dec 22, 2020
Merged

[OSS] Balance the trainable params only#262
blefaudeux merged 6 commits intomasterfrom
oss_partition_huggingface

Conversation

@blefaudeux
Copy link
Copy Markdown
Contributor

@blefaudeux blefaudeux commented Dec 17, 2020

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
  • Did you read the contributor guideline?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #261, also huggingface/transformers#9156. I suspect that this was also partially to blame with the iGPT trainings

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Yes 🙃 HuggingFace / ShardedDDP / AMP works, at least the dummy example shared

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 17, 2020
@blefaudeux blefaudeux changed the title fix, one liner [OSS] Balance the optimizable params only, they count for optim state Dec 17, 2020
@blefaudeux blefaudeux changed the title [OSS] Balance the optimizable params only, they count for optim state [OSS] Balance the optimizable params only Dec 17, 2020
@blefaudeux blefaudeux requested a review from msbaines December 17, 2020 08:26
@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Dec 17, 2020

Thank you very much for fixing that, @blefaudeux

@blefaudeux blefaudeux changed the title [OSS] Balance the optimizable params only [OSS] Balance the trainable params only Dec 18, 2020
@blefaudeux blefaudeux requested a review from min-xu-ai December 18, 2020 15:43
Comment thread fairscale/optim/oss.py
_params_t = Any


class BucketFlush(Enum):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code removal

Comment thread fairscale/optim/oss.py
if work_handle.callback is not None:
work_handle.callback()

def _handle_trailing_buckets(self, flush_type: BucketFlush) -> None:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, sorry about that

Comment thread fairscale/optim/oss.py
param_lists[rank].append(param)
sizes[rank] += param.numel()

# We're partitioning the optimizer state,
Copy link
Copy Markdown
Contributor Author

@blefaudeux blefaudeux Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the real change, the partitioning was not taking into account the fact that the params are trainable or not, although this is what counts for the optimizer state. The test case for Huggingface was kind of pathological for that, because there was one big non trainable parameter (goes to rank 0) and then a lot of cumulatively smaller trainable parameters, which all went to rank 1. This meant that the model was effectively optimized on rank 1, hence defeating the whole sharding purpose

@blefaudeux
Copy link
Copy Markdown
Contributor Author

ping reviews, if you don't mind, I would love master to work for HuggingFace

@blefaudeux blefaudeux requested a review from joshim5 December 19, 2020 04:45
Comment thread tests/optim/test_oss.py Outdated
params.append(torch.rand(size, 1))

# Make sure that the params are trainable, enforces size-based partitioning
for p in params:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add a test case where some params are not trainable too?

Comment thread tests/optim/test_oss.py Outdated

o = optim.OSS(params, lr=0.1)
assert len(o.param_groups) == 1
o.add_param_group({"params": [torch.rand(3, 1)]})
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@min-xu-ai this parameter is non trainable actually, so there's a mix of both

@blefaudeux
Copy link
Copy Markdown
Contributor Author

The CI issue seems unrelated, pipe benchmark and host with an old cuda, I'm missing some context but could have a look late. cc @msbaines just in case

@blefaudeux blefaudeux requested a review from min-xu-ai December 21, 2020 16:40
Copy link
Copy Markdown
Contributor

@min-xu-ai min-xu-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice. Thanks for adding comments and tests.

@blefaudeux blefaudeux merged commit c386e93 into master Dec 22, 2020
@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Dec 22, 2020

@blefaudeux, when you feel all the important changes have been merged into master - please ping me so that I could re-test with transformers, and then it'd be great to make a new fairscale release on pypi and then we can announce it as working with transformers.

I have a brief doc ready to merge when the above has come to satisfaction huggingface/transformers#9208 (in case you'd like to add anything there please don't hesitate to suggest) and then we can make an announcement that transformers has Sharded ZeRO features from fairscale integrated. Yay!

@blefaudeux blefaudeux deleted the oss_partition_huggingface branch December 22, 2020 03:48
@blefaudeux
Copy link
Copy Markdown
Contributor Author

blefaudeux commented Dec 22, 2020

@blefaudeux, when you feel all the important changes have been merged into master - please ping me so that I could re-test with transformers, and then it'd be great to make a new fairscale release on pypi and then we can announce it as working with transformers.

I have a brief doc ready to merge when the above has come to satisfaction huggingface/transformers#9208 (in case you'd like to add anything there please don't hesitate to suggest) and then we can make an announcement that transformers has Sharded ZeRO features from fairscale integrated. Yay!

Hi @stas00, should be good to go ! The current issue with CircleCI is unrelated, problem with the default torch install via pip being incompatible with the provided CUDA on these machines, else the two PRs required for HuggingFace (which were related to genuine issues) are landed. Please keep me posted if you encounter any issues, and thanks for the great work around that ! Looking forward to the announce, I'll have a look :) (spotty availability right now, but doing my best)

@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Dec 22, 2020

That's great!

I rebuilt and retested and everything looks good.

Would it be possible to make a new release on pypi first and then we are good to announce.

Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[OSS] Partitioning can fail if very imbalanced parameters

4 participants