Skip to content
This repository was archived by the owner on Jan 6, 2023. It is now read-only.
This repository was archived by the owner on Jan 6, 2023. It is now read-only.

Improve docs on writing training scripts compatible with scale down #116

@kiukchung

Description

@kiukchung

📚 Documentation

Link

What does it currently say?

Current doc does not have explicit instructions on setting NCCL_BLOCKING_WAIT which is essential for scale-downs with nccl as the process group backend since it ensures that workers do not get blocked forever waiting for others in the nccl kernel. See #115 for more context.

What should it say?

Mention that:

  1. NCCL_BLOCKING_WAIT=1 should be set as environment variable
  2. When [torchelastic][circleci] Fix etcd download path #1 is true, then the timeout parameter in dist.init_process_group() becomes the time (in seconds) that the nccl watchdog (in pytorch) times out when nccl kernels do not return prompty. The default is 30min which may be too long for scale down events. Document that the user should set this parameter to whatever makes sense for their application - will be a function of the frequency of scale-down events and the "size" of the application's nccl operations. Setting this value to something too small will result in false positives where normal long-running nccl kernels are timed out.

Why?

Not setting NCCL_BLOCKING_WAIT results in the application not being able to properly scale-down.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions