└──>$ ./zero_to_fp32.py global_step3250/ pytorch_model1.bin
Processing zero checkpoint 'global_step3250/'
Detected checkpoint of type zero stage 3, world_size: 1
Traceback (most recent call last):
File "./zero_to_fp32.py", line 151, in <module>
convert_zero_chkpt_to_fp32_consolid_state_dict(args.checkpoint_dir, args.output_file)
File "./zero_to_fp32.py", line 122, in convert_zero_chkpt_to_fp32_consolid_state_dict
tuple(fp32_flat_groups[i].narrow(0,
File "./zero_to_fp32.py", line 122, in <genexpr>
tuple(fp32_flat_groups[i].narrow(0,
RuntimeError: start (32899072) + length (16777216) exceeds dimension size (32899072).
In #892 @stas00 proposed a new script which can consolidate fp32 weights from fp16 model checkpoint on stage 3 training.
Unfortunately I have found? that t5-11b model can't be consolidated due to some error:
Maybe @stas00 could say what is the problem, and how it can be fixed?