-
Notifications
You must be signed in to change notification settings - Fork 4.7k
[launcher/runner] respect CUDA_VISIBLE_DEVICES for a single node #960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[launcher/runner] respect CUDA_VISIBLE_DEVICES for a single node #960
Conversation
|
@stas00, one issue I realized with this. I think we want to also guard against the case where the user's hostfile contains more than 1 node it's not clear what the CUDA_VISIBLE_DEVICES variable should be. Essentially if |
|
Shouldn't the rest of the existing launcher code already detect that condition and bail out? After all this PR does a very simple thing, instead of having a user pass: So if you had a situation where a user passed |
|
@jeffra, do you have some resources to give feedback so that this feature can be completed? Thank you! |
jeffra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a fix to support multi-node. If you don't define a valid hostfile then you're running 1 node and we respect CUDA_VISIBLE_DEVICES values. If we have a hostfile then we assume the user uses our flags to control gpu/node counts
This PR extends
launcher/runner.pyto respect CUDA_VISIBLE_DEVICES for a single node and no explicit resource filters"Until now to specify a specific GPU to run on required (e.g. gpu1):
After this PR one can do:
which is how most of the pytorch ecosphere on a single node works.
This is important for things where the launcher command line is not exposed to the user - for example
pytesttests, so one can now run tests just on specific devices, with:This support is limited to a single node and no explicit
num_gpusor include/exclude rules specified.@jeffra