When I train the model with zero 2 and overlap_comm=True on one single P4d instance, I received the following error:
10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/engine.py", line 850, in backward
10.0.89.152: self.allreduce_gradients()
10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/engine.py", line 770, in allreduce_gradients
10.0.89.152: self.optimizer.overlapping_partition_gradients_reduce_epilogue()
10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 581, in overlapping_partition_gradients_reduce_epilogue
10.0.89.152: self.independent_gradient_partition_epilogue()
10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 470, in independent_gradient_partition_epilogue
10.0.89.152: self.reduce_ipg_grads()
10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 954, in reduce_ipg_grads
10.0.89.152: elements_per_buffer=self.elements_in_ipg_bucket)
10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1123, in buffered_reduce_fallback
10.0.89.152: split_buckets = split_half_float_double(grads)
10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 40, in split_half_float_double
10.0.89.152: bucket = [t for t in tensors if t.type() == dtype]
10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 40, in <listcomp>
10.0.89.152: bucket = [t for t in tensors if t.type() == dtype]
10.0.89.152: AttributeError: 'NoneType' object has no attribute 'type'
The model config is:
"bert_model_config": {
"vocab_size_or_config_json_file": 32003,
"hidden_size": 1024,
"num_hidden_layers": 38,
"num_attention_heads": 16,
"intermediate_size": 4096,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"attention_probs_dropout_prob": 0.1,
"max_position_embeddings": 512,
"initializer_range": 0.02
}
If I turn off overlap_comm, it works.
When I train the model with zero 2 and
overlap_comm=Trueon one single P4d instance, I received the following error:The model config is:
If I turn off overlap_comm, it works.