Skip to content

[docs] outline sharded ddp doc#9208

Merged
stas00 merged 5 commits intohuggingface:masterfrom
stas00:zero-docs
Jan 6, 2021
Merged

[docs] outline sharded ddp doc#9208
stas00 merged 5 commits intohuggingface:masterfrom
stas00:zero-docs

Conversation

@stas00
Copy link
Copy Markdown
Contributor

@stas00 stas00 commented Dec 19, 2020

This PR provides an initial outline of HF Trainer integration starting with ZeRO. We have fairscale's Sharded optimizer/gradients supported already and deepspeed is coming

We won't merge this until fairscale merged all the required fixes and released a new version, but I thought it'd be good to start the doc going so it's ready when fairscale is ready.

I hope to submit a deepspeed integration shortly as well, so we will extend it then with deepspeed info. edit (#9211)

@sgugger

Copy link
Copy Markdown
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for drafting this!

Comment thread docs/source/training.rst Outdated
Comment thread docs/source/training.rst Outdated
Comment thread docs/source/training.rst Outdated
Comment thread docs/source/training.rst Outdated
Comment thread docs/source/training.rst Outdated
Comment thread docs/source/training.rst Outdated
stas00 and others added 2 commits December 20, 2020 10:34
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Comment thread docs/source/training.rst
One of the main benefits of enabling `--sharded_ddp` is that it uses a lot less GPU memory, so you should be able to
use significantly larger batch sizes using the same hardware (e.g. 3x or bigger).

Eventually more parts will be supported via integrating `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be clear, fairscale and deepspeed do not share code, so it's not a natural follow up (by no way against that, feel free of course, it's just that somebody reading this could understand it that way). Fairscale/OSS & shardedDDP are certainly based on the ZeRO paper ideas and the credit here is very valid to me, I'm not disputing that of course.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make it clearer that they are both different implementations of the ZeRO paper.

Copy link
Copy Markdown
Contributor Author

@stas00 stas00 Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blefaudeux, we want to make sure that:

  1. both fairscale and deepspace get the full awesomeness factor loud and clear - so please do make any suggestions that you see fit - both of your projects are amazing!
  2. the users have an easy way to understand when to use which and how they can evaluate pros and cons - so again any suggestions for clarification are very welcome.

The deepspeed integration PR #9211 is coming along nicely so we will expand and make this doc section even more clear and balanced - this was just the very basic entry point to send users to when they ask - what's sharded_ddp is about, where to read about it, what nuances to know about, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants