You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 6, 2023. It is now read-only.
Current doc does not have explicit instructions on setting NCCL_BLOCKING_WAIT which is essential for scale-downs with nccl as the process group backend since it ensures that workers do not get blocked forever waiting for others in the nccl kernel. See #115 for more context.
What should it say?
Mention that:
NCCL_BLOCKING_WAIT=1 should be set as environment variable
When [torchelastic][circleci] Fix etcd download path #1 is true, then the timeout parameter in dist.init_process_group() becomes the time (in seconds) that the nccl watchdog (in pytorch) times out when nccl kernels do not return prompty. The default is 30min which may be too long for scale down events. Document that the user should set this parameter to whatever makes sense for their application - will be a function of the frequency of scale-down events and the "size" of the application's nccl operations. Setting this value to something too small will result in false positives where normal long-running nccl kernels are timed out.
Why?
Not setting NCCL_BLOCKING_WAIT results in the application not being able to properly scale-down.
📚 Documentation
Link
What does it currently say?
Current doc does not have explicit instructions on setting
NCCL_BLOCKING_WAITwhich is essential for scale-downs withncclas the process group backend since it ensures that workers do not get blocked forever waiting for others in the nccl kernel. See #115 for more context.What should it say?
Mention that:
NCCL_BLOCKING_WAIT=1should be set as environment variabletimeoutparameter indist.init_process_group()becomes the time (in seconds) that the nccl watchdog (in pytorch) times out when nccl kernels do not return prompty. The default is 30min which may be too long for scale down events. Document that the user should set this parameter to whatever makes sense for their application - will be a function of the frequency of scale-down events and the "size" of the application's nccl operations. Setting this value to something too small will result in false positives where normal long-running nccl kernels are timed out.Why?
Not setting
NCCL_BLOCKING_WAITresults in the application not being able to properly scale-down.