Skip to content

Conversation

@ChihayaK
Copy link

@ChihayaK ChihayaK commented Nov 25, 2025

Details

Work item: #2026

What were the changes?
Addresses issue #2026.
Supported gfx1151 by enabling it with a similar path as gfx1100.
Fixed the enqueue.cc variable reference error that caused a crash when the all_reduce_bias operation was called.

Why were the changes made?
To support vLLM inference across two Strix Halo devices.

How was the outcome achieved?
It just works; I just needed to enable it in the codebase. To verify it works, I tested rccl-test with gfx1151 support. With NCCL_DMABUF_ENABLE=1 11/12 tests passed. The only test that did not pass was the all_reduce_bias_perf test, which caused a segmentation fault. Further investigation showed that the function that stops these kinds of unsupported architectures is broken.

The vllm works with RCCL enabled with gfx1151 when tested with Qwen3-4B across two nodes. The pipeline parallel will fail if not disabled the cuda graph, but the tensor parallel works without any issue it seems. I also tested full fp16 llama3.3-70b weight with vllm with tp=2, it runs if cuda graph is disabled (but only getting 1.8 tokens/s and fits the math of at ~250G/s ish memory speed and consider slow connection between two nodes). Which kind of proofs that the current code base can be used to support inference across multiple nodes.

Additional Documentation:
Rccl-test results

Click to show results
======== all_reduce_perf ========
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288079 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26450 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    14.22    0.00    0.00      0    13.51    0.00    0.00      0
          16             4     float     sum      -1    13.59    0.00    0.00      0    12.60    0.00    0.00      0
          32             8     float     sum      -1    12.16    0.00    0.00      0    13.41    0.00    0.00      0
          64            16     float     sum      -1    13.73    0.00    0.00      0    14.13    0.00    0.00      0
         128            32     float     sum      -1    13.46    0.01    0.01      0    13.31    0.01    0.01      0
         256            64     float     sum      -1    12.76    0.02    0.02      0    13.78    0.02    0.02      0
         512           128     float     sum      -1    13.13    0.04    0.04      0    14.19    0.04    0.04      0
        1024           256     float     sum      -1    14.01    0.07    0.07      0    15.16    0.07    0.07      0
        2048           512     float     sum      -1    19.13    0.11    0.11      0    19.29    0.11    0.11      0
        4096          1024     float     sum      -1    20.29    0.20    0.20      0    20.15    0.20    0.20      0
        8192          2048     float     sum      -1    22.02    0.37    0.37      0    22.02    0.37    0.37      0
       16384          4096     float     sum      -1    24.25    0.68    0.68      0    23.88    0.69    0.69      0
       32768          8192     float     sum      -1    31.03    1.06    1.06      0    29.58    1.11    1.11      0
       65536         16384     float     sum      -1    37.57    1.74    1.74      0    37.25    1.76    1.76      0
      131072         32768     float     sum      -1    50.61    2.59    2.59      0    59.39    2.21    2.21      0
      262144         65536     float     sum      -1    83.17    3.15    3.15      0    97.55    2.69    2.69      0
      524288        131072     float     sum      -1    101.0    5.19    5.19      0    103.7    5.06    5.06      0
     1048576        262144     float     sum      -1    207.0    5.07    5.07      0    214.4    4.89    4.89      0
     2097152        524288     float     sum      -1    364.4    5.76    5.76      0    489.4    4.29    4.29      0
     4194304       1048576     float     sum      -1    681.9    6.15    6.15      0    684.5    6.13    6.13      0
     8388608       2097152     float     sum      -1   1387.5    6.05    6.05      0   1403.8    5.98    5.98      0
    16777216       4194304     float     sum      -1   2640.1    6.35    6.35      0   2624.4    6.39    6.39      0
    33554432       8388608     float     sum      -1   5269.0    6.37    6.37      0   5246.7    6.40    6.40      0
    67108864      16777216     float     sum      -1    10532    6.37    6.37      0    10499    6.39    6.39      0
   134217728      33554432     float     sum      -1    21025    6.38    6.38      0    20913    6.42    6.42      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.49879
#
# Collective test concluded: all_reduce_perf


======== all_gather_perf ========
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288130 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26523 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none      -1     0.19    0.00    0.00      0     0.17    0.00    0.00      0
           0             0     float    none      -1     0.15    0.00    0.00      0     0.12    0.00    0.00      0
          32             4     float    none      -1    11.70    0.00    0.00      0    10.89    0.00    0.00      0
          64             8     float    none      -1    10.51    0.01    0.00      0    10.84    0.01    0.00      0
         128            16     float    none      -1    10.30    0.01    0.01      0    10.47    0.01    0.01      0
         256            32     float    none      -1     9.90    0.03    0.01      0    10.35    0.02    0.01      0
         512            64     float    none      -1     9.87    0.05    0.03      0    10.37    0.05    0.02      0
        1024           128     float    none      -1    15.07    0.07    0.03      0    14.26    0.07    0.04      0
        2048           256     float    none      -1    14.46    0.14    0.07      0    14.71    0.14    0.07      0
        4096           512     float    none      -1    15.56    0.26    0.13      0    15.40    0.27    0.13      0
        8192          1024     float    none      -1    16.58    0.49    0.25      0    15.78    0.52    0.26      0
       16384          2048     float    none      -1    17.79    0.92    0.46      0    17.37    0.94    0.47      0
       32768          4096     float    none      -1    19.35    1.69    0.85      0    19.06    1.72    0.86      0
       65536          8192     float    none      -1    22.27    2.94    1.47      0    21.73    3.02    1.51      0
      131072         16384     float    none      -1    30.03    4.36    2.18      0    27.15    4.83    2.41      0
      262144         32768     float    none      -1    40.55    6.46    3.23      0    39.06    6.71    3.36      0
      524288         65536     float    none      -1    62.06    8.45    4.22      0    61.03    8.59    4.30      0
     1048576        131072     float    none      -1    118.3    8.86    4.43      0    103.0   10.18    5.09      0
     2097152        262144     float    none      -1    200.1   10.48    5.24      0    194.2   10.80    5.40      0
     4194304        524288     float    none      -1    382.1   10.98    5.49      0    372.4   11.26    5.63      0
     8388608       1048576     float    none      -1    720.3   11.65    5.82      0    702.7   11.94    5.97      0
    16777216       2097152     float    none      -1   1430.2   11.73    5.87      0   1384.9   12.11    6.06      0
    33554432       4194304     float    none      -1   2875.2   11.67    5.84      0   2777.6   12.08    6.04      0
    67108864       8388608     float    none      -1   5733.3   11.71    5.85      0   5523.7   12.15    6.07      0
   134217728      16777216     float    none      -1    11452   11.72    5.86      0    11028   12.17    6.09      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.34279
#
# Collective test concluded: all_gather_perf


======== broadcast_perf ========
# Collective test starting: broadcast_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288170 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26597 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float    none       0     9.26    0.00    0.00      0     7.87    0.00    0.00      0
          16             4     float    none       0     7.65    0.00    0.00      0     7.36    0.00    0.00      0
          32             8     float    none       0     6.88    0.00    0.00      0     7.25    0.00    0.00      0
          64            16     float    none       0     6.88    0.01    0.01      0     8.04    0.01    0.01      0
         128            32     float    none       0     6.89    0.02    0.02      0     7.27    0.02    0.02      0
         256            64     float    none       0     7.10    0.04    0.04      0     7.22    0.04    0.04      0
         512           128     float    none       0     8.88    0.06    0.06      0     8.92    0.06    0.06      0
        1024           256     float    none       0     8.58    0.12    0.12      0     8.84    0.12    0.12      0
        2048           512     float    none       0     8.63    0.24    0.24      0     8.92    0.23    0.23      0
        4096          1024     float    none       0     8.74    0.47    0.47      0     8.65    0.47    0.47      0
        8192          2048     float    none       0     8.96    0.91    0.91      0    19.04    0.43    0.43      0
       16384          4096     float    none       0    10.04    1.63    1.63      0    10.23    1.60    1.60      0
       32768          8192     float    none       0    10.65    3.08    3.08      0    10.37    3.16    3.16      0
       65536         16384     float    none       0    13.21    4.96    4.96      0    13.11    5.00    5.00      0
      131072         32768     float    none       0    24.34    5.39    5.39      0    24.18    5.42    5.42      0
      262144         65536     float    none       0    43.00    6.10    6.10      0    42.86    6.12    6.12      0
      524288        131072     float    none       0    84.62    6.20    6.20      0    93.74    5.59    5.59      0
     1048576        262144     float    none       0    151.9    6.90    6.90      0    149.8    7.00    7.00      0
     2097152        524288     float    none       0    295.7    7.09    7.09      0    313.8    6.68    6.68      0
     4194304       1048576     float    none       0    603.0    6.96    6.96      0    588.2    7.13    7.13      0
     8388608       2097152     float    none       0   1189.1    7.05    7.05      0   1170.7    7.17    7.17      0
    16777216       4194304     float    none       0   2365.1    7.09    7.09      0   2350.3    7.14    7.14      0
    33554432       8388608     float    none       0   4691.4    7.15    7.15      0   4692.0    7.15    7.15      0
    67108864      16777216     float    none       0   9407.6    7.13    7.13      0   9352.9    7.18    7.18      0
   134217728      33554432     float    none       0    18781    7.15    7.15      0    18756    7.16    7.16      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.41227
#
# Collective test concluded: broadcast_perf


======== reduce_scatter_perf ========
# Collective test starting: reduce_scatter_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288220 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26668 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float     sum      -1     0.15    0.00    0.00      0     0.10    0.00    0.00      0
           0             0     float     sum      -1     0.08    0.00    0.00      0     0.10    0.00    0.00      0
          32             4     float     sum      -1    11.47    0.00    0.00      0    22.10    0.00    0.00      0
          64             8     float     sum      -1    10.51    0.01    0.00      0    10.51    0.01    0.00      0
         128            16     float     sum      -1    10.24    0.01    0.01      0    10.21    0.01    0.01      0
         256            32     float     sum      -1    10.16    0.03    0.01      0    10.19    0.03    0.01      0
         512            64     float     sum      -1     9.81    0.05    0.03      0    10.19    0.05    0.03      0
        1024           128     float     sum      -1    13.89    0.07    0.04      0    14.91    0.07    0.03      0
        2048           256     float     sum      -1    14.20    0.14    0.07      0    14.38    0.14    0.07      0
        4096           512     float     sum      -1    14.90    0.27    0.14      0    14.83    0.28    0.14      0
        8192          1024     float     sum      -1    16.33    0.50    0.25      0    16.12    0.51    0.25      0
       16384          2048     float     sum      -1    17.29    0.95    0.47      0    28.28    0.58    0.29      0
       32768          4096     float     sum      -1    18.63    1.76    0.88      0    18.91    1.73    0.87      0
       65536          8192     float     sum      -1    21.70    3.02    1.51      0    21.60    3.03    1.52      0
      131072         16384     float     sum      -1    28.33    4.63    2.31      0    30.26    4.33    2.17      0
      262144         32768     float     sum      -1    51.46    5.09    2.55      0    40.67    6.45    3.22      0
      524288         65536     float     sum      -1    61.87    8.47    4.24      0    62.12    8.44    4.22      0
     1048576        131072     float     sum      -1    107.8    9.72    4.86      0    107.8    9.72    4.86      0
     2097152        262144     float     sum      -1    199.5   10.51    5.26      0    200.6   10.45    5.23      0
     4194304        524288     float     sum      -1    378.7   11.08    5.54      0    377.6   11.11    5.55      0
     8388608       1048576     float     sum      -1    734.8   11.42    5.71      0    737.7   11.37    5.69      0
    16777216       2097152     float     sum      -1   1453.6   11.54    5.77      0   1448.0   11.59    5.79      0
    33554432       4194304     float     sum      -1   2879.8   11.65    5.83      0   2881.7   11.64    5.82      0
    67108864       8388608     float     sum      -1   5766.4   11.64    5.82      0   5756.0   11.66    5.83      0
   134217728      16777216     float     sum      -1    11509   11.66    5.83      0    11486   11.69    5.84      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.29119
#
# Collective test concluded: reduce_scatter_perf


======== reduce_perf ========
# Collective test starting: reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288260 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26739 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum       0    33.99    0.00    0.00      0     9.99    0.00    0.00      0
          16             4     float     sum       0    10.28    0.00    0.00      0     9.28    0.00    0.00      0
          32             8     float     sum       0     9.71    0.00    0.00      0     9.73    0.00    0.00      0
          64            16     float     sum       0     9.73    0.01    0.01      0     9.96    0.01    0.01      0
         128            32     float     sum       0     9.32    0.01    0.01      0     9.86    0.01    0.01      0
         256            64     float     sum       0    10.05    0.03    0.03      0     9.83    0.03    0.03      0
         512           128     float     sum       0    11.35    0.05    0.05      0    11.11    0.05    0.05      0
        1024           256     float     sum       0    16.65    0.06    0.06      0    11.57    0.09    0.09      0
        2048           512     float     sum       0    17.25    0.12    0.12      0    11.55    0.18    0.18      0
        4096          1024     float     sum       0    11.86    0.35    0.35      0    11.61    0.35    0.35      0
        8192          2048     float     sum       0    12.27    0.67    0.67      0    12.44    0.66    0.66      0
       16384          4096     float     sum       0    12.23    1.34    1.34      0    12.99    1.26    1.26      0
       32768          8192     float     sum       0    13.31    2.46    2.46      0    13.60    2.41    2.41      0
       65536         16384     float     sum       0    16.16    4.06    4.06      0    16.88    3.88    3.88      0
      131072         32768     float     sum       0    27.38    4.79    4.79      0    27.69    4.73    4.73      0
      262144         65536     float     sum       0    45.20    5.80    5.80      0    50.31    5.21    5.21      0
      524288        131072     float     sum       0    90.78    5.78    5.78      0    84.14    6.23    6.23      0
     1048576        262144     float     sum       0    162.0    6.47    6.47      0    153.6    6.83    6.83      0
     2097152        524288     float     sum       0    302.9    6.92    6.92      0    306.2    6.85    6.85      0
     4194304       1048576     float     sum       0    593.9    7.06    7.06      0    587.2    7.14    7.14      0
     8388608       2097152     float     sum       0   1185.3    7.08    7.08      0   1186.6    7.07    7.07      0
    16777216       4194304     float     sum       0   2366.1    7.09    7.09      0   2355.7    7.12    7.12      0
    33554432       8388608     float     sum       0   4717.9    7.11    7.11      0   4727.0    7.10    7.10      0
    67108864      16777216     float     sum       0   9406.1    7.13    7.13      0   9421.2    7.12    7.12      0
   134217728      33554432     float     sum       0    18746    7.16    7.16      0    18733    7.16    7.16      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.26086
#
# Collective test concluded: reduce_perf


======== alltoall_perf ========
# Collective test starting: alltoall_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288351 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26810 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none      -1     0.06    0.00    0.00      0     0.04    0.00    0.00    N/A
           0             0     float    none      -1     0.04    0.00    0.00      0     0.06    0.00    0.00    N/A
          32             4     float    none      -1    12.95    0.00    0.00      0    13.11    0.00    0.00    N/A
          64             8     float    none      -1    12.66    0.01    0.00      0    13.04    0.00    0.00    N/A
         128            16     float    none      -1    12.35    0.01    0.01      0    13.25    0.01    0.00    N/A
         256            32     float    none      -1    12.34    0.02    0.01      0    13.51    0.02    0.01    N/A
         512            64     float    none      -1    12.46    0.04    0.02      0    13.22    0.04    0.02    N/A
        1024           128     float    none      -1    12.75    0.08    0.04      0    13.34    0.08    0.04    N/A
        2048           256     float    none      -1    12.93    0.16    0.08      0    13.53    0.15    0.08    N/A
        4096           512     float    none      -1    13.38    0.31    0.15      0    14.27    0.29    0.14    N/A
        8192          1024     float    none      -1    14.65    0.56    0.28      0    14.66    0.56    0.28    N/A
       16384          2048     float    none      -1    15.86    1.03    0.52      0    15.98    1.03    0.51    N/A
       32768          4096     float    none      -1    17.48    1.87    0.94      0    17.81    1.84    0.92    N/A
       65536          8192     float    none      -1    21.89    2.99    1.50      0    21.73    3.02    1.51    N/A
      131072         16384     float    none      -1    33.02    3.97    1.98      0    35.10    3.73    1.87    N/A
      262144         32768     float    none      -1    42.91    6.11    3.05      0    46.84    5.60    2.80    N/A
      524288         65536     float    none      -1    63.30    8.28    4.14      0    68.98    7.60    3.80    N/A
     1048576        131072     float    none      -1    113.7    9.23    4.61      0    110.0    9.53    4.76    N/A
     2097152        262144     float    none      -1    191.3   10.96    5.48      0    194.1   10.80    5.40    N/A
     4194304        524288     float    none      -1    356.0   11.78    5.89      0    366.8   11.44    5.72    N/A
     8388608       1048576     float    none      -1    688.3   12.19    6.09      0    689.3   12.17    6.09    N/A
    16777216       2097152     float    none      -1   1350.0   12.43    6.21      0   1348.0   12.45    6.22    N/A
    33554432       4194304     float    none      -1   2665.5   12.59    6.29      0   2675.8   12.54    6.27    N/A
    67108864       8388608     float    none      -1   5305.2   12.65    6.32      0   5309.3   12.64    6.32    N/A
   134217728      16777216     float    none      -1    10580   12.69    6.34      0    10589   12.67    6.34    N/A
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.38156
#
# Collective test concluded: alltoall_perf


======== scatter_perf ========
# Collective test starting: scatter_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288392 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26882 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none       0     0.08    0.00    0.00      0     0.04    0.00    0.00      0
           0             0     float    none       0     0.06    0.00    0.00      0     0.05    0.00    0.00      0
          32             4     float    none       0     9.52    0.00    0.00      0     9.05    0.00    0.00      0
          64             8     float    none       0     8.01    0.01    0.00      0     8.84    0.01    0.00      0
         128            16     float    none       0     7.92    0.02    0.01      0     9.04    0.01    0.01      0
         256            32     float    none       0     9.04    0.03    0.01      0     8.83    0.03    0.01      0
         512            64     float    none       0     7.99    0.06    0.03      0     8.68    0.06    0.03      0
        1024           128     float    none       0     8.10    0.13    0.06      0     8.84    0.12    0.06      0
        2048           256     float    none       0     7.80    0.26    0.13      0     9.58    0.21    0.11      0
        4096           512     float    none       0     8.10    0.51    0.25      0     8.98    0.46    0.23      0
        8192          1024     float    none       0     8.39    0.98    0.49      0     9.45    0.87    0.43      0
       16384          2048     float    none       0     9.47    1.73    0.86      0     9.52    1.72    0.86      0
       32768          4096     float    none       0    11.51    2.85    1.42      0    11.20    2.93    1.46      0
       65536          8192     float    none       0    14.06    4.66    2.33      0    14.03    4.67    2.34      0
      131072         16384     float    none       0    22.19    5.91    2.95      0    22.18    5.91    2.95      0
      262144         32768     float    none       0    31.97    8.20    4.10      0    32.31    8.11    4.06      0
      524288         65536     float    none       0    54.98    9.54    4.77      0    54.39    9.64    4.82      0
     1048576        131072     float    none       0    94.93   11.05    5.52      0    93.75   11.19    5.59      0
     2097152        262144     float    none       0    171.6   12.22    6.11      0    178.5   11.75    5.87      0
     4194304        524288     float    none       0    330.9   12.68    6.34      0    326.3   12.85    6.43      0
     8388608       1048576     float    none       0    653.7   12.83    6.42      0    640.5   13.10    6.55      0
    16777216       2097152     float    none       0   1259.2   13.32    6.66      0   1263.0   13.28    6.64      0
    33554432       4194304     float    none       0   2505.8   13.39    6.70      0   2509.9   13.37    6.68      0
    67108864       8388608     float    none       0   4994.9   13.44    6.72      0   4998.0   13.43    6.71      0
   134217728      16777216     float    none       0   9967.1   13.47    6.73      0   9980.8   13.45    6.72      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.74422
#
# Collective test concluded: scatter_perf


======== gather_perf ========
# Collective test starting: gather_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288442 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26955 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none       0     0.07    0.00    0.00      0     0.06    0.00    0.00      0
           0             0     float    none       0     0.04    0.00    0.00      0     0.04    0.00    0.00      0
          32             4     float    none       0    41.64    0.00    0.00      0    16.50    0.00    0.00      0
          64             8     float    none       0    10.46    0.01    0.00      0    11.43    0.01    0.00      0
         128            16     float    none       0    10.27    0.01    0.01      0     9.59    0.01    0.01      0
         256            32     float    none       0     9.33    0.03    0.01      0    16.87    0.02    0.01      0
         512            64     float    none       0     9.34    0.05    0.03      0    10.16    0.05    0.03      0
        1024           128     float    none       0    10.58    0.10    0.05      0    11.58    0.09    0.04      0
        2048           256     float    none       0    12.44    0.16    0.08      0    11.80    0.17    0.09      0
        4096           512     float    none       0    11.64    0.35    0.18      0    11.82    0.35    0.17      0
        8192          1024     float    none       0    12.11    0.68    0.34      0    11.74    0.70    0.35      0
       16384          2048     float    none       0    12.05    1.36    0.68      0    11.76    1.39    0.70      0
       32768          4096     float    none       0    13.05    2.51    1.26      0    13.19    2.48    1.24      0
       65536          8192     float    none       0    16.33    4.01    2.01      0    16.27    4.03    2.01      0
      131072         16384     float    none       0    27.27    4.81    2.40      0    24.62    5.32    2.66      0
      262144         32768     float    none       0    34.68    7.56    3.78      0    34.25    7.65    3.83      0
      524288         65536     float    none       0    61.76    8.49    4.24      0    64.35    8.15    4.07      0
     1048576        131072     float    none       0    94.78   11.06    5.53      0    95.52   10.98    5.49      0
     2097152        262144     float    none       0    191.1   10.97    5.49      0    175.2   11.97    5.99      0
     4194304        524288     float    none       0    339.1   12.37    6.18      0    340.6   12.32    6.16      0
     8388608       1048576     float    none       0    648.7   12.93    6.47      0    643.5   13.04    6.52      0
    16777216       2097152     float    none       0   1260.1   13.31    6.66      0   1270.5   13.20    6.60      0
    33554432       4194304     float    none       0   2513.5   13.35    6.67      0   2511.8   13.36    6.68      0
    67108864       8388608     float    none       0   5000.6   13.42    6.71      0   5002.9   13.41    6.71      0
   134217728      16777216     float    none       0   9988.5   13.44    6.72      0   9983.1   13.44    6.72      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.63133
#
# Collective test concluded: gather_perf


======== sendrecv_perf ========
# Collective test starting: sendrecv_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288483 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  27028 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    17.36    0.00    0.00      0    16.22    0.00    0.00    N/A
          16             4     float     sum      -1    15.34    0.00    0.00      0    15.64    0.00    0.00    N/A
          32             8     float     sum      -1    15.01    0.00    0.00      0    12.71    0.00    0.00    N/A
          64            16     float     sum      -1    12.67    0.01    0.01      0    12.74    0.01    0.01    N/A
         128            32     float     sum      -1    12.43    0.01    0.01      0    12.61    0.01    0.01    N/A
         256            64     float     sum      -1    13.07    0.02    0.02      0    12.74    0.02    0.02    N/A
         512           128     float     sum      -1    12.77    0.04    0.04      0    13.19    0.04    0.04    N/A
        1024           256     float     sum      -1    13.85    0.07    0.07      0    13.05    0.08    0.08    N/A
        2048           512     float     sum      -1    14.13    0.14    0.14      0    13.36    0.15    0.15    N/A
        4096          1024     float     sum      -1    14.79    0.28    0.28      0    13.96    0.29    0.29    N/A
        8192          2048     float     sum      -1    15.13    0.54    0.54      0    16.02    0.51    0.51    N/A
       16384          4096     float     sum      -1    17.54    0.93    0.93      0    19.29    0.85    0.85    N/A
       32768          8192     float     sum      -1    22.08    1.48    1.48      0    21.19    1.55    1.55    N/A
       65536         16384     float     sum      -1    32.57    2.01    2.01      0    34.12    1.92    1.92    N/A
      131072         32768     float     sum      -1    54.24    2.42    2.42      0    46.41    2.82    2.82    N/A
      262144         65536     float     sum      -1    62.95    4.16    4.16      0    66.57    3.94    3.94    N/A
      524288        131072     float     sum      -1    103.5    5.06    5.06      0    118.7    4.42    4.42    N/A
     1048576        262144     float     sum      -1    187.9    5.58    5.58      0    194.0    5.41    5.41    N/A
     2097152        524288     float     sum      -1    351.9    5.96    5.96      0    362.2    5.79    5.79    N/A
     4194304       1048576     float     sum      -1    688.3    6.09    6.09      0    690.9    6.07    6.07    N/A
     8388608       2097152     float     sum      -1   1349.7    6.21    6.21      0   1355.0    6.19    6.19    N/A
    16777216       4194304     float     sum      -1   2665.0    6.30    6.30      0   2671.1    6.28    6.28    N/A
    33554432       8388608     float     sum      -1   5303.8    6.33    6.33      0   5307.6    6.32    6.32    N/A
    67108864      16777216     float     sum      -1    10578    6.34    6.34      0    10585    6.34    6.34    N/A
   134217728      33554432     float     sum      -1    21120    6.36    6.36      0    21131    6.35    6.35    N/A
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.63449
#
# Collective test concluded: sendrecv_perf


======== alltoallv_perf ========
# Collective test starting: alltoallv_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288534 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  27100 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none      -1     0.06    0.00    0.00      0     0.04    0.00    0.00    N/A
           0             0     float    none      -1     0.04    0.00    0.00      0     0.06    0.00    0.00    N/A
          32             4     float    none      -1    13.08    0.00    0.00      0    13.18    0.00    0.00    N/A
          64             8     float    none      -1    12.82    0.00    0.00      0    13.04    0.00    0.00    N/A
         128            16     float    none      -1    12.33    0.01    0.01      0    14.19    0.01    0.00    N/A
         256            32     float    none      -1    12.40    0.02    0.01      0    13.01    0.02    0.01    N/A
         512            64     float    none      -1    12.64    0.04    0.02      0    12.99    0.04    0.02    N/A
        1024           128     float    none      -1    12.76    0.08    0.04      0    13.17    0.08    0.04    N/A
        2048           256     float    none      -1    24.03    0.09    0.04      0    13.89    0.15    0.07    N/A
        4096           512     float    none      -1    13.58    0.30    0.15      0    14.27    0.29    0.14    N/A
        8192          1024     float    none      -1    14.73    0.56    0.28      0    14.82    0.55    0.28    N/A
       16384          2048     float    none      -1    15.92    1.03    0.51      0    15.95    1.03    0.51    N/A
       32768          4096     float    none      -1    17.57    1.87    0.93      0    17.94    1.83    0.91    N/A
       65536          8192     float    none      -1    21.89    2.99    1.50      0    21.59    3.04    1.52    N/A
      131072         16384     float    none      -1    33.69    3.89    1.95      0    34.10    3.84    1.92    N/A
      262144         32768     float    none      -1    53.41    4.91    2.45      0    47.48    5.52    2.76    N/A
      524288         65536     float    none      -1    63.67    8.23    4.12      0    67.96    7.71    3.86    N/A
     1048576        131072     float    none      -1    104.4   10.04    5.02      0    110.0    9.53    4.77    N/A
     2097152        262144     float    none      -1    188.0   11.15    5.58      0    196.2   10.69    5.35    N/A
     4194304        524288     float    none      -1    363.8   11.53    5.76      0    370.2   11.33    5.67    N/A
     8388608       1048576     float    none      -1    678.6   12.36    6.18      0    704.7   11.90    5.95    N/A
    16777216       2097152     float    none      -1   1341.1   12.51    6.25      0   1351.2   12.42    6.21    N/A
    33554432       4194304     float    none      -1   2666.9   12.58    6.29      0   2670.0   12.57    6.28    N/A
    67108864       8388608     float    none      -1   5305.4   12.65    6.32      0   5307.7   12.64    6.32    N/A
   134217728      16777216     float    none      -1    10581   12.68    6.34      0    10575   12.69    6.35    N/A
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.37424
#
# Collective test concluded: alltoallv_perf


======== hypercube_perf ========
# Collective test starting: hypercube_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288575 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  27172 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none      -1    10.96    0.00    0.00      0    10.64    0.00    0.00      0
           0             0     float    none      -1    10.34    0.00    0.00      0    10.19    0.00    0.00      0
          32             4     float    none      -1    16.43    0.00    0.00      0    13.15    0.00    0.00      0
          64             8     float    none      -1    16.07    0.00    0.00      0    13.45    0.00    0.00      0
         128            16     float    none      -1    15.83    0.00    0.00      0    12.81    0.00    0.00      0
         256            32     float    none      -1    15.82    0.01    0.01      0    13.18    0.01    0.01      0
         512            64     float    none      -1    15.83    0.02    0.02      0    13.26    0.02    0.02      0
        1024           128     float    none      -1    16.20    0.03    0.03      0    13.37    0.04    0.04      0
        2048           256     float    none      -1    16.37    0.06    0.06      0    13.66    0.07    0.07      0
        4096           512     float    none      -1    17.20    0.12    0.12      0    14.23    0.14    0.14      0
        8192          1024     float    none      -1    18.01    0.23    0.23      0    15.13    0.27    0.27      0
       16384          2048     float    none      -1    19.08    0.43    0.43      0    16.21    0.51    0.51      0
       32768          4096     float    none      -1    23.81    0.69    0.69      0    17.94    0.91    0.91      0
       65536          8192     float    none      -1    24.23    1.35    1.35      0    22.33    1.47    1.47      0
      131072         16384     float    none      -1    36.49    1.80    1.80      0    32.92    1.99    1.99      0
      262144         32768     float    none      -1    48.82    2.68    2.68      0    44.30    2.96    2.96      0
      524288         65536     float    none      -1    69.72    3.76    3.76      0    64.21    4.08    4.08      0
     1048576        131072     float    none      -1    110.5    4.75    4.75      0    104.2    5.03    5.03      0
     2097152        262144     float    none      -1    200.1    5.24    5.24      0    187.6    5.59    5.59      0
     4194304        524288     float    none      -1    365.2    5.74    5.74      0    351.7    5.96    5.96      0
     8388608       1048576     float    none      -1    704.3    5.96    5.96      0    693.4    6.05    6.05      0
    16777216       2097152     float    none      -1   1398.6    6.00    6.00      0   1344.2    6.24    6.24      0
    33554432       4194304     float    none      -1   2755.4    6.09    6.09      0   2670.3    6.28    6.28      0
    67108864       8388608     float    none      -1   5550.7    6.05    6.05      0   5307.5    6.32    6.32      0
   134217728      16777216     float    none      -1    11149    6.02    6.02      0    10581    6.34    6.34      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.34645
#
# Collective test concluded: hypercube_perf


======== all_reduce_bias_perf ========
# Collective test starting: all_reduce_bias_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288637 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  27244 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
SH-1: Test NCCL failure /build/rccl-tests/build/src/hipify/all_reduce_bias.cu.cpp:64 'invalid usage (run with NCCL_DEBUG=WARN for details) / '
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:639
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:870
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/all_reduce_bias.cu.cpp:114
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:1002
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:1737
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:1413
SH-2: Test NCCL failure /build/rccl-tests/build/hipify/all_reduce_bias.cu.cpp:64 'invalid usage (run with NCCL_DEBUG=WARN for details) / '
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:639
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:870
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/all_reduce_bias.cu.cpp:114
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:1002
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:1737
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:1413
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[31958,1],1]
  Exit code:    3
--------------------------------------------------------------------------

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

Comment on lines +518 to +521
WARN("%s: unsupported architecture (%s) for collective %s(%s, %s, %s, %s, Acc=%d, Pipeline=%d).",
__func__, comm->archName,
ncclFuncToString(agg.func), ncclAlgoToString(agg.algorithm), ncclProtoToString(agg.protocol),
ncclDevRedOpToString(agg.opDev.op), ncclDatatypeToString(agg.datatype), (agg.acc != nullptr), agg.pipeline);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently reviewing whether the use of task here was intentional or not. What result was this giving you, @ChihayaK ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not actually very sure, normally this path won't be used so I can revert this change if this is something intentional . But using the task here will cause the test all_reduce_bias_perf to crash. After changing from task-> to agg, it will fail normally without segfaulting. Like this:

======== all_reduce_bias_perf ========
# Collective test starting: all_reduce_bias_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288637 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  27244 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
SH-1: Test NCCL failure /build/rccl-tests/build/src/hipify/all_reduce_bias.cu.cpp:64 'invalid usage (run with NCCL_DEBUG=WARN for details) / '
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:639
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:870
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/all_reduce_bias.cu.cpp:114
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:1002
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:1737
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:1413
SH-2: Test NCCL failure /build/rccl-tests/build/hipify/all_reduce_bias.cu.cpp:64 'invalid usage (run with NCCL_DEBUG=WARN for details) / '
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:639
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:870
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/all_reduce_bias.cu.cpp:114
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:1002
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:1737
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:1413
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[31958,1],1]
  Exit code:    3
--------------------------------------------------------------------------

Fix the enqueue.cc variable refence error that causes crash when operation all_reduce_bias is called.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants