Reconstruction of fp32 weights on stage3 doesn't work

In #892 @stas00 proposed a new script which can consolidate fp32 weights from fp16 model checkpoint on stage 3 training.
Unfortunately I have found? that t5-11b model can't be consolidated due to some error:
```
└──>$ ./zero_to_fp32.py global_step3250/ pytorch_model1.bin
Processing zero checkpoint 'global_step3250/'
Detected checkpoint of type zero stage 3, world_size: 1
Traceback (most recent call last):
  File "./zero_to_fp32.py", line 151, in <module>
    convert_zero_chkpt_to_fp32_consolid_state_dict(args.checkpoint_dir, args.output_file)
  File "./zero_to_fp32.py", line 122, in convert_zero_chkpt_to_fp32_consolid_state_dict
    tuple(fp32_flat_groups[i].narrow(0,
  File "./zero_to_fp32.py", line 122, in <genexpr>
    tuple(fp32_flat_groups[i].narrow(0,
RuntimeError: start (32899072) + length (16777216) exceeds dimension size (32899072).
```

Maybe @stas00 could say what is the problem, and how it can be fixed?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconstruction of fp32 weights on stage3 doesn't work #1009

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reconstruction of fp32 weights on stage3 doesn't work #1009

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions