When running something like: deepspeed --include worker-1 test.py from worker-0 we currently run test.py only on worker-0. This is due to this line of code in our runner:
https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/launcher/runner.py#L307
We currently prevent pdsh from launching the job if the number of workers is 1. However, we do not currently check to make sure that the 1 worker is the local worker we are invoking the deepspeed launcher from.
When running something like:
deepspeed --include worker-1 test.pyfrom worker-0 we currently run test.py only on worker-0. This is due to this line of code in our runner:https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/launcher/runner.py#L307
We currently prevent
pdshfrom launching the job if the number of workers is 1. However, we do not currently check to make sure that the 1 worker is the local worker we are invoking thedeepspeedlauncher from.