Skip to content

Conversation

@xlliu-scitix
Copy link
Collaborator

No description provided.

jbachan and others added 30 commits December 18, 2024 11:20
…ffer sizes to be multiples of 16 bytes.

This ensures non-pow2 ranks have buffer addresses aligned suitably for performance.
Add support for Blackwell sm100 and sm120 from CUDA 12.8

Add support for Hopper sm90 from CUDA 12.0
`NCCL_TESTS_SPLIT` serves as new way of computing the color for splitting communicators.

Will be overrided by `NCCL_TESTS_SPLIT_MASK`.

Examples:

NCCL_TESTS_SPLIT_MASK="0x7" # color = rank & 0x7. What we do today to run on a DGX with one GPU per node.
NCCL_TESTS_SPLIT="AND 0x7"  # color = rank & 0x7. New way to run on one GPU per node on a DGX, equivalent to NCCL_TESTS_SPLIT_MASK=0x7
NCCL_TESTS_SPLIT="MOD 72"   # color = rank % 72.  One GPU per NVLink domain on an NVL72 system.
NCCL_TESTS_SPLIT="DIV 72"   # color = rank / 72.  Intra NVLink domain on NVL72.

You can also use: "%" "&" "|" "/" for short.
Extra spaces in the middle will be automatically ignored.
Not case sensitive.

The followings are all equivalent:

NCCL_TESTS_SPLIT="%0x7"
NCCL_TESTS_SPLIT="%0b111"
NCCL_TESTS_SPLIT="AND 7"
NCCL_TESTS_SPLIT="and 0x7"
Added new datatypes: f8e4m3, f8e5m2

Only supported on H100+ architectures and NCCL versions >= 2.24.0
Build option DSO=1 generates libverifiable.so which can be
used to reduce the combined binary size.

Build option NAME_SUFFIX can be used to a add suffix to all
generated binaries. e.g. NAME_SUFFIX=_mpi

Added new make target: clean_intermediates
From NCCL 2.27.x we can now use the Symmetric Memory APIs (-R 2)
One thing missing from the stdout of each performance test is
the name of the test that is actually being run.

This patch adds 2 new messages to the stdout. At the beginning
of the execution of a test (e.g. sendrecv_perf) we will now
see this message:

  Collective test starting: sendrecv_perf

And at the end, we will now see this:

  Collective test concluded: sendrecv_perf

This is needed when running several tests consecutively and we're
trying to parse the stdout to collect the results.

For example, using a Python script to parse the stdout, one could
retrieve the results for each test and plot them on a graph. This
patch makes it easier to implement such a script.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Move comments to separate lines
Also, don't allow minBytes > maxBytes
Print the name of the program being executed before and after test output
Loops between minBytes and maxBytes doubling size each time

Reduced default warmup iteration count to 1 (was 5)
Added Device API infrastructure and example kernels
Two new command line arguments:

  -D <num> device kernel implementation to use <0/1/2/3/4>
  -V <num> number of CTAs to launch device kernels with

Added new CTA Policy command line option:

  -x <policy> set the CTA Policy <0/1/2>
Fix compilation failure on ctaPolicy with NCCL <= 2.26.
Fix compilation failure on local_register with NCCL <= 2.18.
Fix ctaPolicy behavior if the tests are compiled with NCCL <= 2.26
but run with NCCL >= 2.27.
The CUDA error message "Test CUDA failure util.cu:706 'invalid device ordinal'"
is not as helpful. Test this explicitly and guide the user.
This adds support for writing structured information about the run to a JSON file.

Enable with -J <filename>.json

If the target JSON filename already exists then an incrementing numeric suffix will be
added to create <filename>.json.<n>
- add GIN-only A2A kernel implementation
- add hybrid LSA+GIN A2A kernel implementation
- update perf test cases to expose a function for setting
  devCommRequirements for each device implementation and
  simplify devCommCreate code path to use this directly instead
  of complex fallback logic
- add missing call to devCommDestroy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants