Skip to content

w/o model-parallel usability numbers reproduce #284

@g-karthik

Description

@g-karthik

I've been using DeepSpeed successfully with my large model train jobs. But this blog post says ZeRO-1 and ZeRO-2 power up to 6B and 13B param training respectively w/o model-parallelism.

Where exactly is the code that validates/enables others to reproduce this claim? I want to see what model was used, what was the batch size, what was the max sequence length, etc. for this claim.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions