OneBitAdam Incompatible with Pipeline Parallelism

So after a bit of work we finally got 1-bit Adam working over at https://github.com/EleutherAI/gpt-neox

But it seems not to be compatible with Pipeline Parallelism. My hypothesis is that the WaitAll here https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/custom_collectives.py#L99 is waiting for all ranks to reach that point, but since the pipeline stages don't all execute concurrently, some never reach it, and it errors out. 

The actual error code / stacktrace I'm getting is:

```  File "/src/deepspeed/deepspeed/__init__.py", line 122, in initialize
    engine = PipelineEngine(args=args,
  File "/src/deepspeed/deepspeed/runtime/pipe/engine.py", line 52, in __init__
    super().__init__(*super_args, **super_kwargs)
  File "/src/deepspeed/deepspeed/runtime/engine.py", line 174, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/src/deepspeed/deepspeed/runtime/engine.py", line 572, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "/src/deepspeed/deepspeed/runtime/engine.py", line 628, in _configure_fp16_optimizer
    optimizer = FP16_Optimizer(
  File "/src/deepspeed/deepspeed/runtime/fp16/fused_optimizer.py", line 104, in __init__
    self.initialize_optimizer_states()
  File "/src/deepspeed/deepspeed/runtime/fp16/fused_optimizer.py", line 112, in initialize_optimizer_states
    self.optimizer.step()
  File "/src/deepspeed/deepspeed/runtime/fp16/onebit_adam.py", line 340, in step
    self.Compressed_Allreduce(exp_avg,
  File "/src/deepspeed/deepspeed/runtime/fp16/onebit_adam.py", line 157, in Compressed_Allreduce
    cupy_sign_list_packed, cupy_recvbuf_sign, cupy_worker_scale, cupy_recvbuf_scale = gather_host(rank,
  File "/src/deepspeed/deepspeed/runtime/custom_collectives.py", line 99, in gather_host
    MPI.Request.Waitall(requests)
  File "mpi4py/MPI/Request.pyx", line 124, in mpi4py.MPI.Request.Waitall
mpi4py.MPI.Exception: MPI_ERR_IN_STATUS: error code in status
```

To test my hypothesis I ran a pipeline model with a single stage (so no actual parallelism, but still using the Pipeline Module / Pipeline Engine classes), and this works fine.

I wonder if this is something you've encountered, or potentially something that's fixed by https://github.com/microsoft/DeepSpeed/pull/817 ?

@conglongli ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OneBitAdam Incompatible with Pipeline Parallelism #818

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OneBitAdam Incompatible with Pipeline Parallelism #818

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions