So after a bit of work we finally got 1-bit Adam working over at https://github.com/EleutherAI/gpt-neox
But it seems not to be compatible with Pipeline Parallelism. My hypothesis is that the WaitAll here https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/custom_collectives.py#L99 is waiting for all ranks to reach that point, but since the pipeline stages don't all execute concurrently, some never reach it, and it errors out.
The actual error code / stacktrace I'm getting is:
engine = PipelineEngine(args=args,
File "/src/deepspeed/deepspeed/runtime/pipe/engine.py", line 52, in __init__
super().__init__(*super_args, **super_kwargs)
File "/src/deepspeed/deepspeed/runtime/engine.py", line 174, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/src/deepspeed/deepspeed/runtime/engine.py", line 572, in _configure_optimizer
self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
File "/src/deepspeed/deepspeed/runtime/engine.py", line 628, in _configure_fp16_optimizer
optimizer = FP16_Optimizer(
File "/src/deepspeed/deepspeed/runtime/fp16/fused_optimizer.py", line 104, in __init__
self.initialize_optimizer_states()
File "/src/deepspeed/deepspeed/runtime/fp16/fused_optimizer.py", line 112, in initialize_optimizer_states
self.optimizer.step()
File "/src/deepspeed/deepspeed/runtime/fp16/onebit_adam.py", line 340, in step
self.Compressed_Allreduce(exp_avg,
File "/src/deepspeed/deepspeed/runtime/fp16/onebit_adam.py", line 157, in Compressed_Allreduce
cupy_sign_list_packed, cupy_recvbuf_sign, cupy_worker_scale, cupy_recvbuf_scale = gather_host(rank,
File "/src/deepspeed/deepspeed/runtime/custom_collectives.py", line 99, in gather_host
MPI.Request.Waitall(requests)
File "mpi4py/MPI/Request.pyx", line 124, in mpi4py.MPI.Request.Waitall
mpi4py.MPI.Exception: MPI_ERR_IN_STATUS: error code in status
To test my hypothesis I ran a pipeline model with a single stage (so no actual parallelism, but still using the Pipeline Module / Pipeline Engine classes), and this works fine.
I wonder if this is something you've encountered, or potentially something that's fixed by #817 ?
@conglongli ?
So after a bit of work we finally got 1-bit Adam working over at https://github.com/EleutherAI/gpt-neox
But it seems not to be compatible with Pipeline Parallelism. My hypothesis is that the WaitAll here https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/custom_collectives.py#L99 is waiting for all ranks to reach that point, but since the pipeline stages don't all execute concurrently, some never reach it, and it errors out.
The actual error code / stacktrace I'm getting is:
To test my hypothesis I ran a pipeline model with a single stage (so no actual parallelism, but still using the Pipeline Module / Pipeline Engine classes), and this works fine.
I wonder if this is something you've encountered, or potentially something that's fixed by #817 ?
@conglongli ?