I've been using DeepSpeed successfully with my large model train jobs. But this blog post says ZeRO-1 and ZeRO-2 power up to 6B and 13B param training respectively w/o model-parallelism.
Where exactly is the code that validates/enables others to reproduce this claim? I want to see what model was used, what was the batch size, what was the max sequence length, etc. for this claim.
I've been using DeepSpeed successfully with my large model train jobs. But this blog post says ZeRO-1 and ZeRO-2 power up to 6B and 13B param training respectively w/o model-parallelism.
Where exactly is the code that validates/enables others to reproduce this claim? I want to see what model was used, what was the batch size, what was the max sequence length, etc. for this claim.