Skip to content

ZeRO stage 1 has different convergence compared to stage 0 and 2 #757

@ykim362

Description

@ykim362

When I train transformer architecture (encoder + decoder) with ZeRO stage 1, it shows different loss curve compared to stage 0 and stage 2.
On the other hand, the trainings with stage 0 and stage 2 look very similar.

transformer-zero

23: stage 0, 24: stage 1, 25: stage 2

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions