Skip to content

AttributeError: 'NoneType' object has no attribute 'type' with overlap_comm=True and zero 2 #684

@szhengac

Description

@szhengac

When I train the model with zero 2 and overlap_comm=True on one single P4d instance, I received the following error:

10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/engine.py", line 850, in backward
10.0.89.152: self.allreduce_gradients()
10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/engine.py", line 770, in allreduce_gradients
10.0.89.152: self.optimizer.overlapping_partition_gradients_reduce_epilogue()
10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 581, in overlapping_partition_gradients_reduce_epilogue
10.0.89.152: self.independent_gradient_partition_epilogue()
10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 470, in independent_gradient_partition_epilogue
10.0.89.152: self.reduce_ipg_grads()
10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 954, in reduce_ipg_grads
10.0.89.152: elements_per_buffer=self.elements_in_ipg_bucket)
10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1123, in buffered_reduce_fallback
10.0.89.152: split_buckets = split_half_float_double(grads)
10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 40, in split_half_float_double
10.0.89.152: bucket = [t for t in tensors if t.type() == dtype]
10.0.89.152: File "/usr/local/lib64/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 40, in <listcomp>
10.0.89.152: bucket = [t for t in tensors if t.type() == dtype]
10.0.89.152: AttributeError: 'NoneType' object has no attribute 'type'

The model config is:

    "bert_model_config": {
        "vocab_size_or_config_json_file": 32003,
        "hidden_size": 1024,
        "num_hidden_layers": 38,
        "num_attention_heads": 16,
        "intermediate_size": 4096,
        "hidden_act": "gelu",
        "hidden_dropout_prob": 0.1,
        "attention_probs_dropout_prob": 0.1,
        "max_position_embeddings": 512,
        "initializer_range": 0.02
    }

If I turn off overlap_comm, it works.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions