[launcher/runner] respect CUDA_VISIBLE_DEVICES for a single node #960

stas00 · 2021-04-14T22:41:43Z

This PR extends launcher/runner.py to respect CUDA_VISIBLE_DEVICES for a single node and no explicit resource filters"

Until now to specify a specific GPU to run on required (e.g. gpu1):

deepspeed --include localhost:1

After this PR one can do:

CUDA_VISIBLE_DEVICES=1 deepspeed

which is how most of the pytorch ecosphere on a single node works.

This is important for things where the launcher command line is not exposed to the user - for example pytest tests, so one can now run tests just on specific devices, with:

CUDA_VISIBLE_DEVICES=1 pytest ....

This support is limited to a single node and no explicit num_gpus or include/exclude rules specified.

@jeffra

…ce filters

jeffra · 2021-04-27T22:23:03Z

@stas00, one issue I realized with this. I think we want to also guard against the case where the user's hostfile contains more than 1 node it's not clear what the CUDA_VISIBLE_DEVICES variable should be.

https://github.com/microsoft/DeepSpeed/blob/14a50c68c33968f9cd7fefd29e8fe72d4be200e8/deepspeed/launcher/runner.py#L312

Essentially if multi_node_exec is True we will be running on remote nodes where it's not clear what to do with the local CUDA_VISIBLE_DEVICES setting. Can we copy the ignore print down this multi-node case?

stas00 · 2021-04-27T22:57:41Z

Shouldn't the rest of the existing launcher code already detect that condition and bail out?

After all this PR does a very simple thing, instead of having a user pass: --include=localhost:1 the user can now set CUDA_VISIBLE_DEVICES=1, which makes the two ways identical. It unsets CUDA_VISIBLE_DEVICES immediately after that.

So if you had a situation where a user passed --include=localhost:1 and multi_node_exec == True and the launcher didn't assert, then it's the problem with the existing code, no?

stas00 · 2021-06-04T04:27:17Z

@jeffra, do you have some resources to give feedback so that this feature can be completed? Thank you!

jeffra

Pushed a fix to support multi-node. If you don't define a valid hostfile then you're running 1 node and we respect CUDA_VISIBLE_DEVICES values. If we have a hostfile then we assume the user uses our flags to control gpu/node counts

respect CUDA_VISIBLE_DEVICES for a single node and no explicit resour…

ae412f8

…ce filters

stas00 requested review from RezaYazdaniAminabadi, ShadenSmith, arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, samyam and tjruwase as code owners April 14, 2021 22:41

stas00 and others added 4 commits April 14, 2021 15:53

check num_gpus < 0

d2e12be

simplify

f5a1653

wording

c13ef42

Merge branch 'master' into respect-CUDA_VISIBLE_DEVICES

a5d6fa8

tjruwase and others added 4 commits June 4, 2021 07:17

Merge branch 'master' into respect-CUDA_VISIBLE_DEVICES

63e2496

Merge branch 'master' into respect-CUDA_VISIBLE_DEVICES

eeea246

Merge branch 'master' into respect-CUDA_VISIBLE_DEVICES

ae64ca9

should only respect CUDA_VISIBLE_DEVICES if hostfile is not defined

fb7fc3b

jeffra approved these changes Jun 25, 2021

View reviewed changes

jeffra added 2 commits November 16, 2021 19:35

Merge branch 'master' into respect-CUDA_VISIBLE_DEVICES

d3493b2

Merge branch 'master' into respect-CUDA_VISIBLE_DEVICES

c3d1c24

jeffra enabled auto-merge (squash) November 17, 2021 18:08

jeffra disabled auto-merge November 17, 2021 20:21

jeffra merged commit e3c2d7b into deepspeedai:master Nov 17, 2021

stas00 deleted the respect-CUDA_VISIBLE_DEVICES branch November 17, 2021 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[launcher/runner] respect CUDA_VISIBLE_DEVICES for a single node #960

[launcher/runner] respect CUDA_VISIBLE_DEVICES for a single node #960

Uh oh!

stas00 commented Apr 14, 2021 •

edited

Loading

Uh oh!

jeffra commented Apr 27, 2021 •

edited

Loading

Uh oh!

stas00 commented Apr 27, 2021 •

edited

Loading

Uh oh!

stas00 commented Jun 4, 2021

Uh oh!

jeffra left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[launcher/runner] respect CUDA_VISIBLE_DEVICES for a single node #960

[launcher/runner] respect CUDA_VISIBLE_DEVICES for a single node #960

Uh oh!

Conversation

stas00 commented Apr 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffra commented Apr 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Apr 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Jun 4, 2021

Uh oh!

jeffra left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stas00 commented Apr 14, 2021 •

edited

Loading

jeffra commented Apr 27, 2021 •

edited

Loading

stas00 commented Apr 27, 2021 •

edited

Loading