Skip to content

Conversation

@x41lakazam
Copy link

@x41lakazam x41lakazam commented Apr 4, 2024

In a bisection, each rank is paired with one other rank called 'peer', they both send and receive N messages, this number is bound to the agg_iters parameter.

The selection of this 'peer' rank is defined by the getPeer function:
https://github.com/x41lakazam/nccl-tests/blob/bisection_test/src/bisection.cu#L19

@sjeaugey
Copy link
Member

sjeaugey commented Apr 4, 2024

Can't you already do that running the sendrecv_perf test, and setting NCCL_TESTS_SPLIT_MASK=(nranks/2)-1?

Sure, that only works with nranks being a power of two; maybe your code is more generic.

@x41lakazam
Copy link
Author

Being able to run this test when nranks is not a power of two is actually important to us

@x41lakazam x41lakazam force-pushed the bisection_test branch 3 times, most recently from 52ebfc6 to 42a079f Compare April 14, 2024 06:47
Add bisection to makefile

Add bisection doc in performance.md
Add bisection to makefile

Add bisection doc in performance.md
@x41lakazam
Copy link
Author

x41lakazam commented Apr 14, 2024

@AddyLaddy @sjeaugey

The code is ready to merge from our side
Please let me know what you think about it, thanks

@sjeaugey
Copy link
Member

sjeaugey commented Mar 4, 2025

@x41lakazam Given the new NCCL_TESTS_SPLIT option allowing to split ranks in any way and no longer tied to powers of two, could you describe a use case where this allows to test performance in a new/better way?

@AddyLaddy
Copy link
Collaborator

Oh, I wanted the cross rail change removed. Why did you restore it?

@x41lakazam
Copy link
Author

@sjeaugey I didn't notice this change, does it permit bisection on any number of ranks ?

@sjeaugey
Copy link
Member

sjeaugey commented Mar 6, 2025

Yes, if you set NCCL_TESTS_SPLIT="MOD X" with X=N/2 then it should be a bisection test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants