slow ranks search improvements #99

dmonakhov · 2021-12-15T18:07:13Z

No description provided.

This allow us to simulate slow ranks, and deadlocks New options: -S/--slowrank <rank> -D/--slowrank_delay <usec>

Currenly sendrecv allow to send data only to local peers. Let's introduce distance metric for peers, so one can test different cicles. For example ./sendrecv -r -1 will iterate all possible distances, so all NxN communication routes will be tested only in N iterations. This is good diagnostic tool for various network issues.

Currenlty we only way to iterate different roots is to iterace one-by-one, which is not usefull. This patch allows to skip some ranks, where negarive number is a step size For example: My hosts has 8 gpu, so by iterating {0,8,16,...} ranks will emulate all possible hosts orders ./resuce_per -r -8

Communication timeouts are vital build blocks of reliable distributed algorithms. If one of ranks crashes, or deadlock whole test will deadlock forever, this is expected behaviour because of FLP impossibility[1]. NCCL has no built in communication timeout support because it is general purpose library. Timeouts should be implemented at application level. Set default communication timeout to 1800sec (30min), user may change via NCCL_TESTS_COMM_TIMEOUT env. Footnotes: [1] https://en.wikipedia.org/wiki/Consensus_(computer_science)#The_FLP_impossibility_result_for_asynchronous_deterministic_consensus

Dmitry Monakhov added 4 commits December 15, 2021 14:03

add slow rank simulation options

a2b7115

This allow us to simulate slow ranks, and deadlocks New options: -S/--slowrank <rank> -D/--slowrank_delay <usec>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

slow ranks search improvements #99

slow ranks search improvements #99

Uh oh!

dmonakhov commented Dec 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

slow ranks search improvements #99

Are you sure you want to change the base?

slow ranks search improvements #99

Uh oh!

Conversation

dmonakhov commented Dec 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant