setup - 2 local gpus: with the same default config I use everywhere
I have several issues with saving the model:
- the original fp32 model gets saved as fp16 model - how do I get back the fp32 model? The user may not proceed with deepspeed and will want to share the model, so it needs to be back in fp32. Perhaps there should be a
save_fp32_model method? or an option in the checkpoint?
How does it even work on resume if fp32 weights aren't getting saved?
- I use
deepspeed.save_checkpoint(output_dir)
the first checkpoint gets saved it seems if I look at the filesystem, and then it hangs
File "/mnt/nvme1/code/github/00optimize/DeepSpeed-div-by-zero/deepspeed/runtime/engine.py", line 1456 in _checkpoint_tag_validation
File "/mnt/nvme1/code/github/00optimize/DeepSpeed-div-by-zero/deepspeed/runtime/engine.py", line 1489 in save_checkpoint
File "/mnt/nvme1/code/huggingface/transformers-ds-save-model/src/transformers/trainer.py", line 1212 in _save_checkpoint
File "/mnt/nvme1/code/huggingface/transformers-ds-save-model/src/transformers/trainer.py", line 1185 in _maybe_log_save_evaluate
File "/mnt/nvme1/code/huggingface/transformers-ds-save-model/src/transformers/trainer.py", line 1094 in train
(Thanks to @jeffra for the tip on py-spy!)
I obviously tried to disable the check and went on to discover the undocumented config option:
"checkpoint": {
"tag_validation": "ignore"
},
which I reverse engineered. perhaps it could be documented? (warn/ignore/fail are the 3 options)
But I also didn't use any tags...
I also tried to save only from the rank 0 process to no avail.
so when I add this, now it gets stuck in:
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2427 in barrier
File "/mnt/nvme1/code/github/00optimize/DeepSpeed-div-by-zero/deepspeed/runtime/engine.py", line 1524 in _create_zero_checkpoint_files
File "/mnt/nvme1/code/github/00optimize/DeepSpeed-div-by-zero/deepspeed/runtime/engine.py", line 1496 in save_checkpoint
File "/mnt/nvme1/code/huggingface/transformers-ds-save-model/src/transformers/trainer.py", line 1212 in _save_checkpoint
File "/mnt/nvme1/code/huggingface/transformers-ds-save-model/src/transformers/trainer.py", line 1185 in _maybe_log_save_evaluate
File "/mnt/nvme1/code/huggingface/transformers-ds-save-model/src/transformers/trainer.py", line 1094 in train
if I remove save_checkpoint the program doesn't hang and completes just fine.
If I don't save intermediary checkpoints and save only when the training is finished it hangs too (i.e. on the first call)
setup - 2 local gpus: with the same default config I use everywhere
I have several issues with saving the model:
save_fp32_modelmethod? or an option in the checkpoint?How does it even work on resume if fp32 weights aren't getting saved?
deepspeed.save_checkpoint(output_dir)the first checkpoint gets saved it seems if I look at the filesystem, and then it hangs
(Thanks to @jeffra for the tip on py-spy!)
I obviously tried to disable the check and went on to discover the undocumented config option:
which I reverse engineered. perhaps it could be documented? (
warn/ignore/failare the 3 options)But I also didn't use any tags...
I also tried to save only from the rank 0 process to no avail.
so when I add this, now it gets stuck in:
if I remove
save_checkpointthe program doesn't hang and completes just fine.If I don't save intermediary checkpoints and save only when the training is finished it hangs too (i.e. on the first call)