DeepSpeed GPU 0 parameter server OOM with 8xV100

Good morning,
I'm opening this issue because I have a small doubt.

I trained my 24 layers deep Reformer, with 768 nodes per layer, using DeepSpeed on 2xP100 successfully. 
I tried the same setup on 8xV100 and the GPU 0 crashes due to OOM.
My understanding is that the GPU 0 needs to handle all the results computed also by the other GPUs (acting as parameter-server, I guess) and this lead the GPU 0 to have in memory, in addition to the architecture and the batches to compute, also the parameters data to aggregate.

Is that correct? Is it possible to mitigate this effect somehow?

The only way I found to make 8xV100 to work was to reduce the nodes from 768 to 512, which is counterintuitive since with 8xV100 I should be able to train the same architecture as per 2xP100 but four times faster. 

Can you please put some light on this?

Thank you very much in advance for your time!
Cal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed GPU 0 parameter server OOM with 8xV100 #154

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DeepSpeed GPU 0 parameter server OOM with 8xV100 #154

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions