Good morning,
I'm opening this issue because I have a small doubt.
I trained my 24 layers deep Reformer, with 768 nodes per layer, using DeepSpeed on 2xP100 successfully.
I tried the same setup on 8xV100 and the GPU 0 crashes due to OOM.
My understanding is that the GPU 0 needs to handle all the results computed also by the other GPUs (acting as parameter-server, I guess) and this lead the GPU 0 to have in memory, in addition to the architecture and the batches to compute, also the parameters data to aggregate.
Is that correct? Is it possible to mitigate this effect somehow?
The only way I found to make 8xV100 to work was to reduce the nodes from 768 to 512, which is counterintuitive since with 8xV100 I should be able to train the same architecture as per 2xP100 but four times faster.
Can you please put some light on this?
Thank you very much in advance for your time!
Cal
Good morning,
I'm opening this issue because I have a small doubt.
I trained my 24 layers deep Reformer, with 768 nodes per layer, using DeepSpeed on 2xP100 successfully.
I tried the same setup on 8xV100 and the GPU 0 crashes due to OOM.
My understanding is that the GPU 0 needs to handle all the results computed also by the other GPUs (acting as parameter-server, I guess) and this lead the GPU 0 to have in memory, in addition to the architecture and the batches to compute, also the parameters data to aggregate.
Is that correct? Is it possible to mitigate this effect somehow?
The only way I found to make 8xV100 to work was to reduce the nodes from 768 to 512, which is counterintuitive since with 8xV100 I should be able to train the same architecture as per 2xP100 but four times faster.
Can you please put some light on this?
Thank you very much in advance for your time!
Cal