Improve docs on writing training scripts compatible with scale down

## 📚 Documentation

## Link
<not an issue with existing doc>

## What does it currently say?
Current doc does not have explicit instructions on setting `NCCL_BLOCKING_WAIT` which is essential for scale-downs with `nccl` as the process group backend since it ensures that workers do not get blocked forever waiting for others in the nccl kernel. See https://github.com/pytorch/elastic/issues/115 for more context.

## What should it say?
Mention that:
1. `NCCL_BLOCKING_WAIT=1` should be set as environment variable 
2. When #1 is true, then the `timeout` parameter in `dist.init_process_group()` becomes the time (in seconds) that the nccl watchdog (in pytorch) times out when nccl kernels do not return prompty. The default is 30min which may be too long for scale down events. Document that the user should set this parameter to whatever makes sense for their application - will be a function of the frequency of scale-down events and the "size" of the application's nccl operations. Setting this value to something too small will result in false positives where normal long-running nccl kernels are timed out.

### Why?
Not setting `NCCL_BLOCKING_WAIT` results in the application not being able to properly scale-down. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve docs on writing training scripts compatible with scale down #116

📚 Documentation

Link

What does it currently say?

What should it say?

Why?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve docs on writing training scripts compatible with scale down #116

Description

📚 Documentation

Link

What does it currently say?

What should it say?

Why?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions