fix: nvls all reduce correction factor #239
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I was running single server H100 (8xH100 SXM)
nccl-testsand saw that the Bus BW480Gbyte/seven tho the line rate is450Gbyte/s. I was confused and looked further into how bus BW is calcuated and it seems like it is calculated incorrectly for in network reduction algos.According to #212 (comment) , The acutal correction factor should be
bus_bw = algo_bw * (n-1)/(n+1)instead ofbus_bw = algo_bw * 2(n-1)/nThis PR is probably not mergable since
NCCL_ALGOcan be auto picked or be contained in/etc/nccl.confand there doesn't seem to have an API for seeing what algoncclhas chose. Correction factors forCollnetDirectandCollnetChainon the IB Network probably needs to be updated too.But just wanted to put it here in case anyone else in the community is confused about how bus bw could be 106% faster than peak theoretical line rate.
Command
NCCL_ALGO=NVLS ./build/all_reduce_perf -b 8K -e 8G -f 2 -g 8Before
After
Factor vs number of ranks
NVLS read/write