Skip to content

OneBitAdam Incompatible with Pipeline Parallelism #818

@sdtblck

Description

@sdtblck

So after a bit of work we finally got 1-bit Adam working over at https://github.com/EleutherAI/gpt-neox

But it seems not to be compatible with Pipeline Parallelism. My hypothesis is that the WaitAll here https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/custom_collectives.py#L99 is waiting for all ranks to reach that point, but since the pipeline stages don't all execute concurrently, some never reach it, and it errors out.

The actual error code / stacktrace I'm getting is:

    engine = PipelineEngine(args=args,
  File "/src/deepspeed/deepspeed/runtime/pipe/engine.py", line 52, in __init__
    super().__init__(*super_args, **super_kwargs)
  File "/src/deepspeed/deepspeed/runtime/engine.py", line 174, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/src/deepspeed/deepspeed/runtime/engine.py", line 572, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "/src/deepspeed/deepspeed/runtime/engine.py", line 628, in _configure_fp16_optimizer
    optimizer = FP16_Optimizer(
  File "/src/deepspeed/deepspeed/runtime/fp16/fused_optimizer.py", line 104, in __init__
    self.initialize_optimizer_states()
  File "/src/deepspeed/deepspeed/runtime/fp16/fused_optimizer.py", line 112, in initialize_optimizer_states
    self.optimizer.step()
  File "/src/deepspeed/deepspeed/runtime/fp16/onebit_adam.py", line 340, in step
    self.Compressed_Allreduce(exp_avg,
  File "/src/deepspeed/deepspeed/runtime/fp16/onebit_adam.py", line 157, in Compressed_Allreduce
    cupy_sign_list_packed, cupy_recvbuf_sign, cupy_worker_scale, cupy_recvbuf_scale = gather_host(rank,
  File "/src/deepspeed/deepspeed/runtime/custom_collectives.py", line 99, in gather_host
    MPI.Request.Waitall(requests)
  File "mpi4py/MPI/Request.pyx", line 124, in mpi4py.MPI.Request.Waitall
mpi4py.MPI.Exception: MPI_ERR_IN_STATUS: error code in status

To test my hypothesis I ran a pipeline model with a single stage (so no actual parallelism, but still using the Pipeline Module / Pipeline Engine classes), and this works fine.

I wonder if this is something you've encountered, or potentially something that's fixed by #817 ?

@conglongli ?

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions