The docs mention in several places that DeepSpeed can be used with a model that implements its own MP - which my guess it's referring to a vertical model slicing as taught at the pytorch tutorial., that is groups of layers spread out through several GPUs.
OK, we have implemented vertical slicing MP for t5, gpt2 and bart in transformers, using a naive slow approach, but I have no idea how to bolt DeepSpeed on top of such a model that spans several GPUs. The typical problem is that there is a conflict between the model doing its own device management and switching model's data and inputs to different devices, with any external solutions. e.g. I can't do DP or DDP over 2 GPUs which are also used for vertical MP.
I do find it unclear though when in some places the documentation talks about allowing one's own MP and no explanation or examples of how exactly that would work. I think it does suggest to look at Megatron-LM for such an example, but it's a complex project. What would have helped a lot is having a simple example on how to bolt DeepSpeed for an MP-enabled model with perhaps just a few layers, such as the simple toy MP model from the same pytorch tutorial.
Thank you!
And I also understand that the result of combing one's own vertical MP with DeepSpeed might actually be worse than just using the ZeRO Partitioning, and probably the whole setup will be much more complicated, but it's hard to appreciate or compare things if there is a barrier to trying various options.
My gut feeling, which is not yet supported by enough experience, is telling me that DeepSpeed's solution to the memory issue completely eliminates the need for vertical MP (almost definitely once stage 3 is implemented), but since our initiative to implement MP in transformers is sort of incomplete, since the naive MP doesn't scale, I'm trying to either find a way to make it efficient or perhaps remove it altogether if DeepSpeed is all one needs in all possible scenarios.
I'm trying to figure out if bolting just PP onto the naive MP will do the trick. But it's tricky since one needs to "sequentialize" the model's stack for PP to work.
The docs mention in several places that DeepSpeed can be used with a model that implements its own MP - which my guess it's referring to a vertical model slicing as taught at the pytorch tutorial., that is groups of layers spread out through several GPUs.
OK, we have implemented vertical slicing MP for t5, gpt2 and bart in transformers, using a naive slow approach, but I have no idea how to bolt DeepSpeed on top of such a model that spans several GPUs. The typical problem is that there is a conflict between the model doing its own device management and switching model's data and inputs to different devices, with any external solutions. e.g. I can't do DP or DDP over 2 GPUs which are also used for vertical MP.
I do find it unclear though when in some places the documentation talks about allowing one's own MP and no explanation or examples of how exactly that would work. I think it does suggest to look at Megatron-LM for such an example, but it's a complex project. What would have helped a lot is having a simple example on how to bolt DeepSpeed for an MP-enabled model with perhaps just a few layers, such as the simple toy MP model from the same pytorch tutorial.
Thank you!
And I also understand that the result of combing one's own vertical MP with DeepSpeed might actually be worse than just using the ZeRO Partitioning, and probably the whole setup will be much more complicated, but it's hard to appreciate or compare things if there is a barrier to trying various options.
My gut feeling, which is not yet supported by enough experience, is telling me that DeepSpeed's solution to the memory issue completely eliminates the need for vertical MP (almost definitely once stage 3 is implemented), but since our initiative to implement MP in transformers is sort of incomplete, since the naive MP doesn't scale, I'm trying to either find a way to make it efficient or perhaps remove it altogether if DeepSpeed is all one needs in all possible scenarios.
I'm trying to figure out if bolting just PP onto the naive MP will do the trick. But it's tricky since one needs to "sequentialize" the model's stack for PP to work.