w/o model-parallel usability numbers reproduce

I've been using DeepSpeed successfully with my large model train jobs. But [this](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/) blog post says ZeRO-1 and ZeRO-2 power up to 6B and 13B param training respectively w/o model-parallelism.

Where exactly is the code that validates/enables others to reproduce this claim? I want to see what model was used, what was the batch size, what was the max sequence length, etc. for this claim.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

w/o model-parallel usability numbers reproduce #284

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

w/o model-parallel usability numbers reproduce #284

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions