Skip to content

Conversation

@jasl
Copy link

@jasl jasl commented Sep 16, 2025

No description provided.

@AddyLaddy
Copy link
Collaborator

Thanks for the patch.
Do the nccl-tests not run on those platforms without the change?

@johnnynunez
Copy link

@AddyLaddy it is better to add blackwell family to save binaries...
-gencode=arch=compute_120,code=compute_120 to
-gencode=arch=compute_120f,code=sm_120

@jasl
Copy link
Author

jasl commented Sep 17, 2025

Thanks for the patch. Do the nccl-tests not run on those platforms without the change?

No, without the patch, the test can not run

# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid     72 on  jasl-thor device  0 [0000:01:00] NVIDIA Thor
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    33554432       8388608     float     sum      -1jasl-thor: Test CUDA failure common.cu:164 'no kernel image is available for execution on the device'
 .. jasl-thor pid 72: Test failure all_reduce.cu:52
 .. jasl-thor pid 72: Test failure common.cu:459
 .. jasl-thor pid 72: Test failure common.cu:650
 .. jasl-thor pid 72: Test failure all_reduce.cu:518
 .. jasl-thor pid 72: Test failure common.cu:664
 .. jasl-thor pid 72: Test failure common.cu:1386
 .. jasl-thor pid 72: Test failure common.cu:1050

With the patch

# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid     72 on  jasl-thor device  0 [0000:01:00] NVIDIA Thor
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    33554432       8388608     float     sum      -1    297.1  112.93    0.00      0     0.25  135710.54    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0
#
# Collective test concluded: all_reduce_perf

@johnnynunez Thank you for the tip, I'll update it

@jasl
Copy link
Author

jasl commented Sep 17, 2025

I have tested on my Thor and x86 + RTX Pro 6000

=== cnccl-tests/all_reduce_perf ===
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid     72 on  jasl-thor device  0 [0000:01:00] NVIDIA Thor
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    33554432       8388608     float     sum      -1    290.3  115.57    0.00      0     0.46  73431.30    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0
#
# Collective test concluded: all_reduce_perf
=== cnccl-tests/all_reduce_perf ===
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid     29 on jasl-workstation-ubuntu device  0 [0000:11:00] NVIDIA RTX PRO 6000 Blackwell Workstation Edition
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    33554432       8388608     float     sum      -1    10.47  3205.78    0.00      0     0.10  331729.43    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0
#
# Collective test concluded: all_reduce_perf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants