forked from NVIDIA/nccl-tests
-
Notifications
You must be signed in to change notification settings - Fork 0
Sync/upstream 20251216 #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
xlliu-scitix
wants to merge
55
commits into
sicl
Choose a base branch
from
sync/upstream-20251216
base: sicl
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ffer sizes to be multiples of 16 bytes. This ensures non-pow2 ranks have buffer addresses aligned suitably for performance.
Add support for Blackwell sm100 and sm120 from CUDA 12.8 Add support for Hopper sm90 from CUDA 12.0
`NCCL_TESTS_SPLIT` serves as new way of computing the color for splitting communicators. Will be overrided by `NCCL_TESTS_SPLIT_MASK`. Examples: NCCL_TESTS_SPLIT_MASK="0x7" # color = rank & 0x7. What we do today to run on a DGX with one GPU per node. NCCL_TESTS_SPLIT="AND 0x7" # color = rank & 0x7. New way to run on one GPU per node on a DGX, equivalent to NCCL_TESTS_SPLIT_MASK=0x7 NCCL_TESTS_SPLIT="MOD 72" # color = rank % 72. One GPU per NVLink domain on an NVL72 system. NCCL_TESTS_SPLIT="DIV 72" # color = rank / 72. Intra NVLink domain on NVL72. You can also use: "%" "&" "|" "/" for short. Extra spaces in the middle will be automatically ignored. Not case sensitive. The followings are all equivalent: NCCL_TESTS_SPLIT="%0x7" NCCL_TESTS_SPLIT="%0b111" NCCL_TESTS_SPLIT="AND 7" NCCL_TESTS_SPLIT="and 0x7"
Added new datatypes: f8e4m3, f8e5m2 Only supported on H100+ architectures and NCCL versions >= 2.24.0
Build option DSO=1 generates libverifiable.so which can be used to reduce the combined binary size. Build option NAME_SUFFIX can be used to a add suffix to all generated binaries. e.g. NAME_SUFFIX=_mpi Added new make target: clean_intermediates
From NCCL 2.27.x we can now use the Symmetric Memory APIs (-R 2)
One thing missing from the stdout of each performance test is the name of the test that is actually being run. This patch adds 2 new messages to the stdout. At the beginning of the execution of a test (e.g. sendrecv_perf) we will now see this message: Collective test starting: sendrecv_perf And at the end, we will now see this: Collective test concluded: sendrecv_perf This is needed when running several tests consecutively and we're trying to parse the stdout to collect the results. For example, using a Python script to parse the stdout, one could retrieve the results for each test and plot them on a graph. This patch makes it easier to implement such a script. Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Move comments to separate lines
Also, don't allow minBytes > maxBytes
Print the name of the program being executed before and after test output
Loops between minBytes and maxBytes doubling size each time Reduced default warmup iteration count to 1 (was 5)
Added Device API infrastructure and example kernels Two new command line arguments: -D <num> device kernel implementation to use <0/1/2/3/4> -V <num> number of CTAs to launch device kernels with Added new CTA Policy command line option: -x <policy> set the CTA Policy <0/1/2>
Fix compilation failure on ctaPolicy with NCCL <= 2.26. Fix compilation failure on local_register with NCCL <= 2.18. Fix ctaPolicy behavior if the tests are compiled with NCCL <= 2.26 but run with NCCL >= 2.27.
The CUDA error message "Test CUDA failure util.cu:706 'invalid device ordinal'" is not as helpful. Test this explicitly and guide the user.
This adds support for writing structured information about the run to a JSON file. Enable with -J <filename>.json If the target JSON filename already exists then an incrementing numeric suffix will be added to create <filename>.json.<n>
- add GIN-only A2A kernel implementation - add hybrid LSA+GIN A2A kernel implementation - update perf test cases to expose a function for setting devCommRequirements for each device implementation and simplify devCommCreate code path to use this directly instead of complex fallback logic - add missing call to devCommDestroy
Based on code from yakovdyadkin & Scott Moe in MR 349 Adds -S 1 option to suffix each performance report line with a timestamp. Format is "%Y-%m-%d %H:%M:%S" This is especially useful when using the -N 0 option and looking for hangs or failure events.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.