Sync/upstream 20251216 #2

xlliu-scitix · 2025-12-21T06:20:55Z

No description provided.

…ffer sizes to be multiples of 16 bytes. This ensures non-pow2 ranks have buffer addresses aligned suitably for performance.

Add support for Blackwell sm100 and sm120 from CUDA 12.8 Add support for Hopper sm90 from CUDA 12.0

`NCCL_TESTS_SPLIT` serves as new way of computing the color for splitting communicators. Will be overrided by `NCCL_TESTS_SPLIT_MASK`. Examples: NCCL_TESTS_SPLIT_MASK="0x7" # color = rank & 0x7. What we do today to run on a DGX with one GPU per node. NCCL_TESTS_SPLIT="AND 0x7" # color = rank & 0x7. New way to run on one GPU per node on a DGX, equivalent to NCCL_TESTS_SPLIT_MASK=0x7 NCCL_TESTS_SPLIT="MOD 72" # color = rank % 72. One GPU per NVLink domain on an NVL72 system. NCCL_TESTS_SPLIT="DIV 72" # color = rank / 72. Intra NVLink domain on NVL72. You can also use: "%" "&" "|" "/" for short. Extra spaces in the middle will be automatically ignored. Not case sensitive. The followings are all equivalent: NCCL_TESTS_SPLIT="%0x7" NCCL_TESTS_SPLIT="%0b111" NCCL_TESTS_SPLIT="AND 7" NCCL_TESTS_SPLIT="and 0x7"

Added new datatypes: f8e4m3, f8e5m2 Only supported on H100+ architectures and NCCL versions >= 2.24.0

Build option DSO=1 generates libverifiable.so which can be used to reduce the combined binary size. Build option NAME_SUFFIX can be used to a add suffix to all generated binaries. e.g. NAME_SUFFIX=_mpi Added new make target: clean_intermediates

From NCCL 2.27.x we can now use the Symmetric Memory APIs (-R 2)

One thing missing from the stdout of each performance test is the name of the test that is actually being run. This patch adds 2 new messages to the stdout. At the beginning of the execution of a test (e.g. sendrecv_perf) we will now see this message: Collective test starting: sendrecv_perf And at the end, we will now see this: Collective test concluded: sendrecv_perf This is needed when running several tests consecutively and we're trying to parse the stdout to collect the results. For example, using a Python script to parse the stdout, one could retrieve the results for each test and plot them on a graph. This patch makes it easier to implement such a script. Signed-off-by: Martin Belanger <martin.belanger@dell.com>

Move comments to separate lines

Also, don't allow minBytes > maxBytes

Print the name of the program being executed before and after test output

Loops between minBytes and maxBytes doubling size each time Reduced default warmup iteration count to 1 (was 5)

Added Device API infrastructure and example kernels Two new command line arguments: -D <num> device kernel implementation to use <0/1/2/3/4> -V <num> number of CTAs to launch device kernels with Added new CTA Policy command line option: -x <policy> set the CTA Policy <0/1/2>

Fix compilation failure on ctaPolicy with NCCL <= 2.26. Fix compilation failure on local_register with NCCL <= 2.18. Fix ctaPolicy behavior if the tests are compiled with NCCL <= 2.26 but run with NCCL >= 2.27.

The CUDA error message "Test CUDA failure util.cu:706 'invalid device ordinal'" is not as helpful. Test this explicitly and guide the user.

This adds support for writing structured information about the run to a JSON file. Enable with -J <filename>.json If the target JSON filename already exists then an incrementing numeric suffix will be added to create <filename>.json.<n>

- add GIN-only A2A kernel implementation - add hybrid LSA+GIN A2A kernel implementation - update perf test cases to expose a function for setting devCommRequirements for each device implementation and simplify devCommCreate code path to use this directly instead of complex fallback logic - add missing call to devCommDestroy

Based on code from yakovdyadkin & Scott Moe in MR 349 Adds -S 1 option to suffix each performance report line with a timestamp. Format is "%Y-%m-%d %H:%M:%S" This is especially useful when using the -N 0 option and looking for hangs or failure events.

…51216

jbachan and others added 30 commits December 18, 2024 11:20

Fixes to all tests that divide buffers by nranks so that they trim bu…

29f4114

…ffer sizes to be multiples of 16 bytes. This ensures non-pow2 ranks have buffer addresses aligned suitably for performance.

Update CUDA gencodes

cb6a46f

Add support for Blackwell sm100 and sm120 from CUDA 12.8 Add support for Hopper sm90 from CUDA 12.0

Add NCCL_TESTS_SPLIT documentation in the README

903918f

Add PCI domain and device ID for GPU device BDF display

b4300cc

Add support for FP8 datatypes

501a149

Added new datatypes: f8e4m3, f8e5m2 Only supported on H100+ architectures and NCCL versions >= 2.24.0

Re-add sm_70 support for CUDA 12.8+ and 13.0 builds

e041d90

Add support for Symmetric Memory Registration

a5c539e

From NCCL 2.27.x we can now use the Symmetric Memory APIs (-R 2)

Fix formatting errors in README.md

0c60e6a

Need to drop Volta (sm_70) support from CUDA 13.0

8bc16f4

Reinstate Pascal suppport for CUDA 12.8+ builds

5290298

Wrap ncclCommWindowRegister() calls within ncclGroup

e7c8825

Add Turing (SM75) support to CUDA 13.0 builds

97ee098

Minor fix to Makefile

def2d36

Move comments to separate lines

Add extra reserved space during maxBytes calculation

6edafa0

Also, don't allow minBytes > maxBytes

Merge pull request NVIDIA#316 from martin-belanger/print-program-name

fae7cb4

Print the name of the program being executed before and after test output

Modified warmup to run for more message sizes

f2015cb

Loops between minBytes and maxBytes doubling size each time Reduced default warmup iteration count to 1 (was 5)

Update NVCUFLAGS and CXXFLAGS to use -std=c++14

c2cb96f

Fix compilation for old NCCL versions

9a5c154

Fix compilation failure on ctaPolicy with NCCL <= 2.26. Fix compilation failure on local_register with NCCL <= 2.18. Fix ctaPolicy behavior if the tests are compiled with NCCL <= 2.26 but run with NCCL >= 2.27.

Check if sufficient GPUs are available

abc4677

The CUDA error message "Test CUDA failure util.cu:706 'invalid device ordinal'" is not as helpful. Test this explicitly and guide the user.

Add PRINT of nccl-tests, NCCL header and library versions

9641693

NCCL_TESTS_VERSION 2.17.4

3744121

add runtime guards for ncclAlltoAll()

f66d20e

add necessary ifdef guards for device API tests

013c49e

NCCL_TESTS_VERSION 2.17.5

0bb567c

AddyLaddy and others added 25 commits October 28, 2025 10:11

Add new report_timestamps option to README.md

e2af90a

NCCL_TESTS_VERSION 2.17.6

da0b547

Remove trailing WS when timestamp option not used

51f2e7e

Add README.md text for -J option

4bc314a

Merge remote-tracking branch 'upstream/master' into sync/upstream-202…

8ab1ad2

…51216

add scripts

351b3c1

update scripts

a922a05

add Dockerfile

4aebbcb

update Dockerfile

7dce185

add github workflow

cc230e6

Merge branch 'sicl' into sync/upstream-20251216

e9e9a27

delete duplicated pre-check.yml

b5c25a8

update Dockerfile

5fe638c

update pre-check.yml

c9ace03

fix bugs

e115bd8

update Dockerfile

d5d08b9

update dockerfile

eb04f21

update dockefile

2225d3f

update dockerfile

a63433c

update dockerfile

0b7aeaf

update dockerfile

ea941ce

update dockerfile

9b37f54

update dockerfile

db45ae9

update release.yml

587d57c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync/upstream 20251216 #2

Sync/upstream 20251216 #2

Uh oh!

xlliu-scitix commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Sync/upstream 20251216 #2

Are you sure you want to change the base?

Sync/upstream 20251216 #2

Uh oh!

Conversation

xlliu-scitix commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants