Skip to content

Conversation

@dmonakhov
Copy link

No description provided.

Dmitry Monakhov added 4 commits December 15, 2021 14:03
This allow us to simulate slow ranks, and deadlocks
New options:
    -S/--slowrank <rank>
    -D/--slowrank_delay <usec>
Currenly sendrecv allow to send data only to local peers. Let's introduce distance metric
for peers, so one can test different cicles.

For example ./sendrecv -r -1 will iterate all possible distances,
so all NxN communication routes will be tested only in N iterations.
This is good diagnostic tool for various network issues.
Currenlty we only way to iterate different roots is to iterace one-by-one, which is not usefull.
This patch allows to skip some ranks, where negarive number is a step size
For example:
My hosts has 8 gpu, so by iterating {0,8,16,...} ranks will emulate all possible hosts orders 
./resuce_per -r -8
Communication timeouts are vital build blocks of reliable distributed algorithms.
If one of ranks crashes, or deadlock whole test will deadlock forever, this is
expected behaviour because of FLP impossibility[1]. NCCL has no built in
communication timeout support because it is general purpose library.
Timeouts should be implemented at application level. Set default communication
timeout to 1800sec (30min), user may change via NCCL_TESTS_COMM_TIMEOUT env.


Footnotes:
[1] https://en.wikipedia.org/wiki/Consensus_(computer_science)#The_FLP_impossibility_result_for_asynchronous_deterministic_consensus
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant