Skip to content

Conversation

@stas00
Copy link
Collaborator

@stas00 stas00 commented Apr 14, 2021

This PR extends launcher/runner.py to respect CUDA_VISIBLE_DEVICES for a single node and no explicit resource filters"

Until now to specify a specific GPU to run on required (e.g. gpu1):

deepspeed --include localhost:1 

After this PR one can do:

CUDA_VISIBLE_DEVICES=1 deepspeed 

which is how most of the pytorch ecosphere on a single node works.

This is important for things where the launcher command line is not exposed to the user - for example pytest tests, so one can now run tests just on specific devices, with:

CUDA_VISIBLE_DEVICES=1 pytest ....

This support is limited to a single node and no explicit num_gpus or include/exclude rules specified.

@jeffra

@jeffra
Copy link
Collaborator

jeffra commented Apr 27, 2021

@stas00, one issue I realized with this. I think we want to also guard against the case where the user's hostfile contains more than 1 node it's not clear what the CUDA_VISIBLE_DEVICES variable should be.

https://github.com/microsoft/DeepSpeed/blob/14a50c68c33968f9cd7fefd29e8fe72d4be200e8/deepspeed/launcher/runner.py#L312

Essentially if multi_node_exec is True we will be running on remote nodes where it's not clear what to do with the local CUDA_VISIBLE_DEVICES setting. Can we copy the ignore print down this multi-node case?

@stas00
Copy link
Collaborator Author

stas00 commented Apr 27, 2021

Shouldn't the rest of the existing launcher code already detect that condition and bail out?

After all this PR does a very simple thing, instead of having a user pass: --include=localhost:1 the user can now set CUDA_VISIBLE_DEVICES=1, which makes the two ways identical. It unsets CUDA_VISIBLE_DEVICES immediately after that.

So if you had a situation where a user passed --include=localhost:1 and multi_node_exec == True and the launcher didn't assert, then it's the problem with the existing code, no?

@stas00
Copy link
Collaborator Author

stas00 commented Jun 4, 2021

@jeffra, do you have some resources to give feedback so that this feature can be completed? Thank you!

Copy link
Collaborator

@jeffra jeffra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a fix to support multi-node. If you don't define a valid hostfile then you're running 1 node and we respect CUDA_VISIBLE_DEVICES values. If we have a hostfile then we assume the user uses our flags to control gpu/node counts

@jeffra jeffra enabled auto-merge (squash) November 17, 2021 18:08
@jeffra jeffra disabled auto-merge November 17, 2021 20:21
@jeffra jeffra merged commit e3c2d7b into deepspeedai:master Nov 17, 2021
@stas00 stas00 deleted the respect-CUDA_VISIBLE_DEVICES branch November 17, 2021 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants