[docs] outline sharded ddp doc#9208
Conversation
sgugger
left a comment
There was a problem hiding this comment.
Thanks for drafting this!
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
| One of the main benefits of enabling `--sharded_ddp` is that it uses a lot less GPU memory, so you should be able to | ||
| use significantly larger batch sizes using the same hardware (e.g. 3x or bigger). | ||
|
|
||
| Eventually more parts will be supported via integrating `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__. |
There was a problem hiding this comment.
just to be clear, fairscale and deepspeed do not share code, so it's not a natural follow up (by no way against that, feel free of course, it's just that somebody reading this could understand it that way). Fairscale/OSS & shardedDDP are certainly based on the ZeRO paper ideas and the credit here is very valid to me, I'm not disputing that of course.
There was a problem hiding this comment.
We can make it clearer that they are both different implementations of the ZeRO paper.
There was a problem hiding this comment.
@blefaudeux, we want to make sure that:
- both fairscale and deepspace get the full awesomeness factor loud and clear - so please do make any suggestions that you see fit - both of your projects are amazing!
- the users have an easy way to understand when to use which and how they can evaluate pros and cons - so again any suggestions for clarification are very welcome.
The deepspeed integration PR #9211 is coming along nicely so we will expand and make this doc section even more clear and balanced - this was just the very basic entry point to send users to when they ask - what's sharded_ddp is about, where to read about it, what nuances to know about, etc.
This PR provides an initial outline of HF Trainer integration starting with ZeRO. We have fairscale's Sharded optimizer/gradients supported already and deepspeed is coming
We won't merge this until fairscale merged all the required fixes and released a new version, but I thought it'd be good to start the doc going so it's ready when fairscale is ready.
I hope to submit a deepspeed integration shortly as well, so we will extend it then with deepspeed info. edit (#9211)
@sgugger