Skip to content

DeepSpeed GPU 0 parameter server OOM with 8xV100 #154

@CalogeroZarbo

Description

@CalogeroZarbo

Good morning,
I'm opening this issue because I have a small doubt.

I trained my 24 layers deep Reformer, with 768 nodes per layer, using DeepSpeed on 2xP100 successfully.
I tried the same setup on 8xV100 and the GPU 0 crashes due to OOM.
My understanding is that the GPU 0 needs to handle all the results computed also by the other GPUs (acting as parameter-server, I guess) and this lead the GPU 0 to have in memory, in addition to the architecture and the batches to compute, also the parameters data to aggregate.

Is that correct? Is it possible to mitigate this effect somehow?

The only way I found to make 8xV100 to work was to reduce the nodes from 768 to 512, which is counterintuitive since with 8xV100 I should be able to train the same architecture as per 2xP100 but four times faster.

Can you please put some light on this?

Thank you very much in advance for your time!
Cal

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions