Support Strix Halo gfx1151 #2075

ChihayaK · 2025-11-25T06:53:05Z

Details

Work item: #2026

What were the changes?
Addresses issue #2026.
Supported gfx1151 by enabling it with a similar path as gfx1100.
Fixed the enqueue.cc variable reference error that caused a crash when the all_reduce_bias operation was called.

Why were the changes made?
To support vLLM inference across two Strix Halo devices.

How was the outcome achieved?
It just works; I just needed to enable it in the codebase. To verify it works, I tested rccl-test with gfx1151 support. With NCCL_DMABUF_ENABLE=1 11/12 tests passed. The only test that did not pass was the all_reduce_bias_perf test, which caused a segmentation fault. Further investigation showed that the function that stops these kinds of unsupported architectures is broken.

The vllm works with RCCL enabled with gfx1151 when tested with Qwen3-4B across two nodes. The pipeline parallel will fail if not disabled the cuda graph, but the tensor parallel works without any issue it seems. I also tested full fp16 llama3.3-70b weight with vllm with tp=2, it runs if cuda graph is disabled (but only getting 1.8 tokens/s and fits the math of at ~250G/s ish memory speed and consider slow connection between two nodes). Which kind of proofs that the current code base can be used to support inference across multiple nodes.

Additional Documentation:
Rccl-test results

Click to show results

======== all_reduce_perf ========
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288079 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26450 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    14.22    0.00    0.00      0    13.51    0.00    0.00      0
          16             4     float     sum      -1    13.59    0.00    0.00      0    12.60    0.00    0.00      0
          32             8     float     sum      -1    12.16    0.00    0.00      0    13.41    0.00    0.00      0
          64            16     float     sum      -1    13.73    0.00    0.00      0    14.13    0.00    0.00      0
         128            32     float     sum      -1    13.46    0.01    0.01      0    13.31    0.01    0.01      0
         256            64     float     sum      -1    12.76    0.02    0.02      0    13.78    0.02    0.02      0
         512           128     float     sum      -1    13.13    0.04    0.04      0    14.19    0.04    0.04      0
        1024           256     float     sum      -1    14.01    0.07    0.07      0    15.16    0.07    0.07      0
        2048           512     float     sum      -1    19.13    0.11    0.11      0    19.29    0.11    0.11      0
        4096          1024     float     sum      -1    20.29    0.20    0.20      0    20.15    0.20    0.20      0
        8192          2048     float     sum      -1    22.02    0.37    0.37      0    22.02    0.37    0.37      0
       16384          4096     float     sum      -1    24.25    0.68    0.68      0    23.88    0.69    0.69      0
       32768          8192     float     sum      -1    31.03    1.06    1.06      0    29.58    1.11    1.11      0
       65536         16384     float     sum      -1    37.57    1.74    1.74      0    37.25    1.76    1.76      0
      131072         32768     float     sum      -1    50.61    2.59    2.59      0    59.39    2.21    2.21      0
      262144         65536     float     sum      -1    83.17    3.15    3.15      0    97.55    2.69    2.69      0
      524288        131072     float     sum      -1    101.0    5.19    5.19      0    103.7    5.06    5.06      0
     1048576        262144     float     sum      -1    207.0    5.07    5.07      0    214.4    4.89    4.89      0
     2097152        524288     float     sum      -1    364.4    5.76    5.76      0    489.4    4.29    4.29      0
     4194304       1048576     float     sum      -1    681.9    6.15    6.15      0    684.5    6.13    6.13      0
     8388608       2097152     float     sum      -1   1387.5    6.05    6.05      0   1403.8    5.98    5.98      0
    16777216       4194304     float     sum      -1   2640.1    6.35    6.35      0   2624.4    6.39    6.39      0
    33554432       8388608     float     sum      -1   5269.0    6.37    6.37      0   5246.7    6.40    6.40      0
    67108864      16777216     float     sum      -1    10532    6.37    6.37      0    10499    6.39    6.39      0
   134217728      33554432     float     sum      -1    21025    6.38    6.38      0    20913    6.42    6.42      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.49879
#
# Collective test concluded: all_reduce_perf


======== all_gather_perf ========
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288130 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26523 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none      -1     0.19    0.00    0.00      0     0.17    0.00    0.00      0
           0             0     float    none      -1     0.15    0.00    0.00      0     0.12    0.00    0.00      0
          32             4     float    none      -1    11.70    0.00    0.00      0    10.89    0.00    0.00      0
          64             8     float    none      -1    10.51    0.01    0.00      0    10.84    0.01    0.00      0
         128            16     float    none      -1    10.30    0.01    0.01      0    10.47    0.01    0.01      0
         256            32     float    none      -1     9.90    0.03    0.01      0    10.35    0.02    0.01      0
         512            64     float    none      -1     9.87    0.05    0.03      0    10.37    0.05    0.02      0
        1024           128     float    none      -1    15.07    0.07    0.03      0    14.26    0.07    0.04      0
        2048           256     float    none      -1    14.46    0.14    0.07      0    14.71    0.14    0.07      0
        4096           512     float    none      -1    15.56    0.26    0.13      0    15.40    0.27    0.13      0
        8192          1024     float    none      -1    16.58    0.49    0.25      0    15.78    0.52    0.26      0
       16384          2048     float    none      -1    17.79    0.92    0.46      0    17.37    0.94    0.47      0
       32768          4096     float    none      -1    19.35    1.69    0.85      0    19.06    1.72    0.86      0
       65536          8192     float    none      -1    22.27    2.94    1.47      0    21.73    3.02    1.51      0
      131072         16384     float    none      -1    30.03    4.36    2.18      0    27.15    4.83    2.41      0
      262144         32768     float    none      -1    40.55    6.46    3.23      0    39.06    6.71    3.36      0
      524288         65536     float    none      -1    62.06    8.45    4.22      0    61.03    8.59    4.30      0
     1048576        131072     float    none      -1    118.3    8.86    4.43      0    103.0   10.18    5.09      0
     2097152        262144     float    none      -1    200.1   10.48    5.24      0    194.2   10.80    5.40      0
     4194304        524288     float    none      -1    382.1   10.98    5.49      0    372.4   11.26    5.63      0
     8388608       1048576     float    none      -1    720.3   11.65    5.82      0    702.7   11.94    5.97      0
    16777216       2097152     float    none      -1   1430.2   11.73    5.87      0   1384.9   12.11    6.06      0
    33554432       4194304     float    none      -1   2875.2   11.67    5.84      0   2777.6   12.08    6.04      0
    67108864       8388608     float    none      -1   5733.3   11.71    5.85      0   5523.7   12.15    6.07      0
   134217728      16777216     float    none      -1    11452   11.72    5.86      0    11028   12.17    6.09      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.34279
#
# Collective test concluded: all_gather_perf


======== broadcast_perf ========
# Collective test starting: broadcast_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288170 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26597 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float    none       0     9.26    0.00    0.00      0     7.87    0.00    0.00      0
          16             4     float    none       0     7.65    0.00    0.00      0     7.36    0.00    0.00      0
          32             8     float    none       0     6.88    0.00    0.00      0     7.25    0.00    0.00      0
          64            16     float    none       0     6.88    0.01    0.01      0     8.04    0.01    0.01      0
         128            32     float    none       0     6.89    0.02    0.02      0     7.27    0.02    0.02      0
         256            64     float    none       0     7.10    0.04    0.04      0     7.22    0.04    0.04      0
         512           128     float    none       0     8.88    0.06    0.06      0     8.92    0.06    0.06      0
        1024           256     float    none       0     8.58    0.12    0.12      0     8.84    0.12    0.12      0
        2048           512     float    none       0     8.63    0.24    0.24      0     8.92    0.23    0.23      0
        4096          1024     float    none       0     8.74    0.47    0.47      0     8.65    0.47    0.47      0
        8192          2048     float    none       0     8.96    0.91    0.91      0    19.04    0.43    0.43      0
       16384          4096     float    none       0    10.04    1.63    1.63      0    10.23    1.60    1.60      0
       32768          8192     float    none       0    10.65    3.08    3.08      0    10.37    3.16    3.16      0
       65536         16384     float    none       0    13.21    4.96    4.96      0    13.11    5.00    5.00      0
      131072         32768     float    none       0    24.34    5.39    5.39      0    24.18    5.42    5.42      0
      262144         65536     float    none       0    43.00    6.10    6.10      0    42.86    6.12    6.12      0
      524288        131072     float    none       0    84.62    6.20    6.20      0    93.74    5.59    5.59      0
     1048576        262144     float    none       0    151.9    6.90    6.90      0    149.8    7.00    7.00      0
     2097152        524288     float    none       0    295.7    7.09    7.09      0    313.8    6.68    6.68      0
     4194304       1048576     float    none       0    603.0    6.96    6.96      0    588.2    7.13    7.13      0
     8388608       2097152     float    none       0   1189.1    7.05    7.05      0   1170.7    7.17    7.17      0
    16777216       4194304     float    none       0   2365.1    7.09    7.09      0   2350.3    7.14    7.14      0
    33554432       8388608     float    none       0   4691.4    7.15    7.15      0   4692.0    7.15    7.15      0
    67108864      16777216     float    none       0   9407.6    7.13    7.13      0   9352.9    7.18    7.18      0
   134217728      33554432     float    none       0    18781    7.15    7.15      0    18756    7.16    7.16      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.41227
#
# Collective test concluded: broadcast_perf


======== reduce_scatter_perf ========
# Collective test starting: reduce_scatter_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288220 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26668 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float     sum      -1     0.15    0.00    0.00      0     0.10    0.00    0.00      0
           0             0     float     sum      -1     0.08    0.00    0.00      0     0.10    0.00    0.00      0
          32             4     float     sum      -1    11.47    0.00    0.00      0    22.10    0.00    0.00      0
          64             8     float     sum      -1    10.51    0.01    0.00      0    10.51    0.01    0.00      0
         128            16     float     sum      -1    10.24    0.01    0.01      0    10.21    0.01    0.01      0
         256            32     float     sum      -1    10.16    0.03    0.01      0    10.19    0.03    0.01      0
         512            64     float     sum      -1     9.81    0.05    0.03      0    10.19    0.05    0.03      0
        1024           128     float     sum      -1    13.89    0.07    0.04      0    14.91    0.07    0.03      0
        2048           256     float     sum      -1    14.20    0.14    0.07      0    14.38    0.14    0.07      0
        4096           512     float     sum      -1    14.90    0.27    0.14      0    14.83    0.28    0.14      0
        8192          1024     float     sum      -1    16.33    0.50    0.25      0    16.12    0.51    0.25      0
       16384          2048     float     sum      -1    17.29    0.95    0.47      0    28.28    0.58    0.29      0
       32768          4096     float     sum      -1    18.63    1.76    0.88      0    18.91    1.73    0.87      0
       65536          8192     float     sum      -1    21.70    3.02    1.51      0    21.60    3.03    1.52      0
      131072         16384     float     sum      -1    28.33    4.63    2.31      0    30.26    4.33    2.17      0
      262144         32768     float     sum      -1    51.46    5.09    2.55      0    40.67    6.45    3.22      0
      524288         65536     float     sum      -1    61.87    8.47    4.24      0    62.12    8.44    4.22      0
     1048576        131072     float     sum      -1    107.8    9.72    4.86      0    107.8    9.72    4.86      0
     2097152        262144     float     sum      -1    199.5   10.51    5.26      0    200.6   10.45    5.23      0
     4194304        524288     float     sum      -1    378.7   11.08    5.54      0    377.6   11.11    5.55      0
     8388608       1048576     float     sum      -1    734.8   11.42    5.71      0    737.7   11.37    5.69      0
    16777216       2097152     float     sum      -1   1453.6   11.54    5.77      0   1448.0   11.59    5.79      0
    33554432       4194304     float     sum      -1   2879.8   11.65    5.83      0   2881.7   11.64    5.82      0
    67108864       8388608     float     sum      -1   5766.4   11.64    5.82      0   5756.0   11.66    5.83      0
   134217728      16777216     float     sum      -1    11509   11.66    5.83      0    11486   11.69    5.84      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.29119
#
# Collective test concluded: reduce_scatter_perf


======== reduce_perf ========
# Collective test starting: reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288260 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26739 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum       0    33.99    0.00    0.00      0     9.99    0.00    0.00      0
          16             4     float     sum       0    10.28    0.00    0.00      0     9.28    0.00    0.00      0
          32             8     float     sum       0     9.71    0.00    0.00      0     9.73    0.00    0.00      0
          64            16     float     sum       0     9.73    0.01    0.01      0     9.96    0.01    0.01      0
         128            32     float     sum       0     9.32    0.01    0.01      0     9.86    0.01    0.01      0
         256            64     float     sum       0    10.05    0.03    0.03      0     9.83    0.03    0.03      0
         512           128     float     sum       0    11.35    0.05    0.05      0    11.11    0.05    0.05      0
        1024           256     float     sum       0    16.65    0.06    0.06      0    11.57    0.09    0.09      0
        2048           512     float     sum       0    17.25    0.12    0.12      0    11.55    0.18    0.18      0
        4096          1024     float     sum       0    11.86    0.35    0.35      0    11.61    0.35    0.35      0
        8192          2048     float     sum       0    12.27    0.67    0.67      0    12.44    0.66    0.66      0
       16384          4096     float     sum       0    12.23    1.34    1.34      0    12.99    1.26    1.26      0
       32768          8192     float     sum       0    13.31    2.46    2.46      0    13.60    2.41    2.41      0
       65536         16384     float     sum       0    16.16    4.06    4.06      0    16.88    3.88    3.88      0
      131072         32768     float     sum       0    27.38    4.79    4.79      0    27.69    4.73    4.73      0
      262144         65536     float     sum       0    45.20    5.80    5.80      0    50.31    5.21    5.21      0
      524288        131072     float     sum       0    90.78    5.78    5.78      0    84.14    6.23    6.23      0
     1048576        262144     float     sum       0    162.0    6.47    6.47      0    153.6    6.83    6.83      0
     2097152        524288     float     sum       0    302.9    6.92    6.92      0    306.2    6.85    6.85      0
     4194304       1048576     float     sum       0    593.9    7.06    7.06      0    587.2    7.14    7.14      0
     8388608       2097152     float     sum       0   1185.3    7.08    7.08      0   1186.6    7.07    7.07      0
    16777216       4194304     float     sum       0   2366.1    7.09    7.09      0   2355.7    7.12    7.12      0
    33554432       8388608     float     sum       0   4717.9    7.11    7.11      0   4727.0    7.10    7.10      0
    67108864      16777216     float     sum       0   9406.1    7.13    7.13      0   9421.2    7.12    7.12      0
   134217728      33554432     float     sum       0    18746    7.16    7.16      0    18733    7.16    7.16      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.26086
#
# Collective test concluded: reduce_perf


======== alltoall_perf ========
# Collective test starting: alltoall_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288351 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26810 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none      -1     0.06    0.00    0.00      0     0.04    0.00    0.00    N/A
           0             0     float    none      -1     0.04    0.00    0.00      0     0.06    0.00    0.00    N/A
          32             4     float    none      -1    12.95    0.00    0.00      0    13.11    0.00    0.00    N/A
          64             8     float    none      -1    12.66    0.01    0.00      0    13.04    0.00    0.00    N/A
         128            16     float    none      -1    12.35    0.01    0.01      0    13.25    0.01    0.00    N/A
         256            32     float    none      -1    12.34    0.02    0.01      0    13.51    0.02    0.01    N/A
         512            64     float    none      -1    12.46    0.04    0.02      0    13.22    0.04    0.02    N/A
        1024           128     float    none      -1    12.75    0.08    0.04      0    13.34    0.08    0.04    N/A
        2048           256     float    none      -1    12.93    0.16    0.08      0    13.53    0.15    0.08    N/A
        4096           512     float    none      -1    13.38    0.31    0.15      0    14.27    0.29    0.14    N/A
        8192          1024     float    none      -1    14.65    0.56    0.28      0    14.66    0.56    0.28    N/A
       16384          2048     float    none      -1    15.86    1.03    0.52      0    15.98    1.03    0.51    N/A
       32768          4096     float    none      -1    17.48    1.87    0.94      0    17.81    1.84    0.92    N/A
       65536          8192     float    none      -1    21.89    2.99    1.50      0    21.73    3.02    1.51    N/A
      131072         16384     float    none      -1    33.02    3.97    1.98      0    35.10    3.73    1.87    N/A
      262144         32768     float    none      -1    42.91    6.11    3.05      0    46.84    5.60    2.80    N/A
      524288         65536     float    none      -1    63.30    8.28    4.14      0    68.98    7.60    3.80    N/A
     1048576        131072     float    none      -1    113.7    9.23    4.61      0    110.0    9.53    4.76    N/A
     2097152        262144     float    none      -1    191.3   10.96    5.48      0    194.1   10.80    5.40    N/A
     4194304        524288     float    none      -1    356.0   11.78    5.89      0    366.8   11.44    5.72    N/A
     8388608       1048576     float    none      -1    688.3   12.19    6.09      0    689.3   12.17    6.09    N/A
    16777216       2097152     float    none      -1   1350.0   12.43    6.21      0   1348.0   12.45    6.22    N/A
    33554432       4194304     float    none      -1   2665.5   12.59    6.29      0   2675.8   12.54    6.27    N/A
    67108864       8388608     float    none      -1   5305.2   12.65    6.32      0   5309.3   12.64    6.32    N/A
   134217728      16777216     float    none      -1    10580   12.69    6.34      0    10589   12.67    6.34    N/A
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.38156
#
# Collective test concluded: alltoall_perf


======== scatter_perf ========
# Collective test starting: scatter_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288392 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26882 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none       0     0.08    0.00    0.00      0     0.04    0.00    0.00      0
           0             0     float    none       0     0.06    0.00    0.00      0     0.05    0.00    0.00      0
          32             4     float    none       0     9.52    0.00    0.00      0     9.05    0.00    0.00      0
          64             8     float    none       0     8.01    0.01    0.00      0     8.84    0.01    0.00      0
         128            16     float    none       0     7.92    0.02    0.01      0     9.04    0.01    0.01      0
         256            32     float    none       0     9.04    0.03    0.01      0     8.83    0.03    0.01      0
         512            64     float    none       0     7.99    0.06    0.03      0     8.68    0.06    0.03      0
        1024           128     float    none       0     8.10    0.13    0.06      0     8.84    0.12    0.06      0
        2048           256     float    none       0     7.80    0.26    0.13      0     9.58    0.21    0.11      0
        4096           512     float    none       0     8.10    0.51    0.25      0     8.98    0.46    0.23      0
        8192          1024     float    none       0     8.39    0.98    0.49      0     9.45    0.87    0.43      0
       16384          2048     float    none       0     9.47    1.73    0.86      0     9.52    1.72    0.86      0
       32768          4096     float    none       0    11.51    2.85    1.42      0    11.20    2.93    1.46      0
       65536          8192     float    none       0    14.06    4.66    2.33      0    14.03    4.67    2.34      0
      131072         16384     float    none       0    22.19    5.91    2.95      0    22.18    5.91    2.95      0
      262144         32768     float    none       0    31.97    8.20    4.10      0    32.31    8.11    4.06      0
      524288         65536     float    none       0    54.98    9.54    4.77      0    54.39    9.64    4.82      0
     1048576        131072     float    none       0    94.93   11.05    5.52      0    93.75   11.19    5.59      0
     2097152        262144     float    none       0    171.6   12.22    6.11      0    178.5   11.75    5.87      0
     4194304        524288     float    none       0    330.9   12.68    6.34      0    326.3   12.85    6.43      0
     8388608       1048576     float    none       0    653.7   12.83    6.42      0    640.5   13.10    6.55      0
    16777216       2097152     float    none       0   1259.2   13.32    6.66      0   1263.0   13.28    6.64      0
    33554432       4194304     float    none       0   2505.8   13.39    6.70      0   2509.9   13.37    6.68      0
    67108864       8388608     float    none       0   4994.9   13.44    6.72      0   4998.0   13.43    6.71      0
   134217728      16777216     float    none       0   9967.1   13.47    6.73      0   9980.8   13.45    6.72      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.74422
#
# Collective test concluded: scatter_perf


======== gather_perf ========
# Collective test starting: gather_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288442 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  26955 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none       0     0.07    0.00    0.00      0     0.06    0.00    0.00      0
           0             0     float    none       0     0.04    0.00    0.00      0     0.04    0.00    0.00      0
          32             4     float    none       0    41.64    0.00    0.00      0    16.50    0.00    0.00      0
          64             8     float    none       0    10.46    0.01    0.00      0    11.43    0.01    0.00      0
         128            16     float    none       0    10.27    0.01    0.01      0     9.59    0.01    0.01      0
         256            32     float    none       0     9.33    0.03    0.01      0    16.87    0.02    0.01      0
         512            64     float    none       0     9.34    0.05    0.03      0    10.16    0.05    0.03      0
        1024           128     float    none       0    10.58    0.10    0.05      0    11.58    0.09    0.04      0
        2048           256     float    none       0    12.44    0.16    0.08      0    11.80    0.17    0.09      0
        4096           512     float    none       0    11.64    0.35    0.18      0    11.82    0.35    0.17      0
        8192          1024     float    none       0    12.11    0.68    0.34      0    11.74    0.70    0.35      0
       16384          2048     float    none       0    12.05    1.36    0.68      0    11.76    1.39    0.70      0
       32768          4096     float    none       0    13.05    2.51    1.26      0    13.19    2.48    1.24      0
       65536          8192     float    none       0    16.33    4.01    2.01      0    16.27    4.03    2.01      0
      131072         16384     float    none       0    27.27    4.81    2.40      0    24.62    5.32    2.66      0
      262144         32768     float    none       0    34.68    7.56    3.78      0    34.25    7.65    3.83      0
      524288         65536     float    none       0    61.76    8.49    4.24      0    64.35    8.15    4.07      0
     1048576        131072     float    none       0    94.78   11.06    5.53      0    95.52   10.98    5.49      0
     2097152        262144     float    none       0    191.1   10.97    5.49      0    175.2   11.97    5.99      0
     4194304        524288     float    none       0    339.1   12.37    6.18      0    340.6   12.32    6.16      0
     8388608       1048576     float    none       0    648.7   12.93    6.47      0    643.5   13.04    6.52      0
    16777216       2097152     float    none       0   1260.1   13.31    6.66      0   1270.5   13.20    6.60      0
    33554432       4194304     float    none       0   2513.5   13.35    6.67      0   2511.8   13.36    6.68      0
    67108864       8388608     float    none       0   5000.6   13.42    6.71      0   5002.9   13.41    6.71      0
   134217728      16777216     float    none       0   9988.5   13.44    6.72      0   9983.1   13.44    6.72      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.63133
#
# Collective test concluded: gather_perf


======== sendrecv_perf ========
# Collective test starting: sendrecv_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288483 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  27028 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    17.36    0.00    0.00      0    16.22    0.00    0.00    N/A
          16             4     float     sum      -1    15.34    0.00    0.00      0    15.64    0.00    0.00    N/A
          32             8     float     sum      -1    15.01    0.00    0.00      0    12.71    0.00    0.00    N/A
          64            16     float     sum      -1    12.67    0.01    0.01      0    12.74    0.01    0.01    N/A
         128            32     float     sum      -1    12.43    0.01    0.01      0    12.61    0.01    0.01    N/A
         256            64     float     sum      -1    13.07    0.02    0.02      0    12.74    0.02    0.02    N/A
         512           128     float     sum      -1    12.77    0.04    0.04      0    13.19    0.04    0.04    N/A
        1024           256     float     sum      -1    13.85    0.07    0.07      0    13.05    0.08    0.08    N/A
        2048           512     float     sum      -1    14.13    0.14    0.14      0    13.36    0.15    0.15    N/A
        4096          1024     float     sum      -1    14.79    0.28    0.28      0    13.96    0.29    0.29    N/A
        8192          2048     float     sum      -1    15.13    0.54    0.54      0    16.02    0.51    0.51    N/A
       16384          4096     float     sum      -1    17.54    0.93    0.93      0    19.29    0.85    0.85    N/A
       32768          8192     float     sum      -1    22.08    1.48    1.48      0    21.19    1.55    1.55    N/A
       65536         16384     float     sum      -1    32.57    2.01    2.01      0    34.12    1.92    1.92    N/A
      131072         32768     float     sum      -1    54.24    2.42    2.42      0    46.41    2.82    2.82    N/A
      262144         65536     float     sum      -1    62.95    4.16    4.16      0    66.57    3.94    3.94    N/A
      524288        131072     float     sum      -1    103.5    5.06    5.06      0    118.7    4.42    4.42    N/A
     1048576        262144     float     sum      -1    187.9    5.58    5.58      0    194.0    5.41    5.41    N/A
     2097152        524288     float     sum      -1    351.9    5.96    5.96      0    362.2    5.79    5.79    N/A
     4194304       1048576     float     sum      -1    688.3    6.09    6.09      0    690.9    6.07    6.07    N/A
     8388608       2097152     float     sum      -1   1349.7    6.21    6.21      0   1355.0    6.19    6.19    N/A
    16777216       4194304     float     sum      -1   2665.0    6.30    6.30      0   2671.1    6.28    6.28    N/A
    33554432       8388608     float     sum      -1   5303.8    6.33    6.33      0   5307.6    6.32    6.32    N/A
    67108864      16777216     float     sum      -1    10578    6.34    6.34      0    10585    6.34    6.34    N/A
   134217728      33554432     float     sum      -1    21120    6.36    6.36      0    21131    6.35    6.35    N/A
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.63449
#
# Collective test concluded: sendrecv_perf


======== alltoallv_perf ========
# Collective test starting: alltoallv_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288534 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  27100 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none      -1     0.06    0.00    0.00      0     0.04    0.00    0.00    N/A
           0             0     float    none      -1     0.04    0.00    0.00      0     0.06    0.00    0.00    N/A
          32             4     float    none      -1    13.08    0.00    0.00      0    13.18    0.00    0.00    N/A
          64             8     float    none      -1    12.82    0.00    0.00      0    13.04    0.00    0.00    N/A
         128            16     float    none      -1    12.33    0.01    0.01      0    14.19    0.01    0.00    N/A
         256            32     float    none      -1    12.40    0.02    0.01      0    13.01    0.02    0.01    N/A
         512            64     float    none      -1    12.64    0.04    0.02      0    12.99    0.04    0.02    N/A
        1024           128     float    none      -1    12.76    0.08    0.04      0    13.17    0.08    0.04    N/A
        2048           256     float    none      -1    24.03    0.09    0.04      0    13.89    0.15    0.07    N/A
        4096           512     float    none      -1    13.58    0.30    0.15      0    14.27    0.29    0.14    N/A
        8192          1024     float    none      -1    14.73    0.56    0.28      0    14.82    0.55    0.28    N/A
       16384          2048     float    none      -1    15.92    1.03    0.51      0    15.95    1.03    0.51    N/A
       32768          4096     float    none      -1    17.57    1.87    0.93      0    17.94    1.83    0.91    N/A
       65536          8192     float    none      -1    21.89    2.99    1.50      0    21.59    3.04    1.52    N/A
      131072         16384     float    none      -1    33.69    3.89    1.95      0    34.10    3.84    1.92    N/A
      262144         32768     float    none      -1    53.41    4.91    2.45      0    47.48    5.52    2.76    N/A
      524288         65536     float    none      -1    63.67    8.23    4.12      0    67.96    7.71    3.86    N/A
     1048576        131072     float    none      -1    104.4   10.04    5.02      0    110.0    9.53    4.77    N/A
     2097152        262144     float    none      -1    188.0   11.15    5.58      0    196.2   10.69    5.35    N/A
     4194304        524288     float    none      -1    363.8   11.53    5.76      0    370.2   11.33    5.67    N/A
     8388608       1048576     float    none      -1    678.6   12.36    6.18      0    704.7   11.90    5.95    N/A
    16777216       2097152     float    none      -1   1341.1   12.51    6.25      0   1351.2   12.42    6.21    N/A
    33554432       4194304     float    none      -1   2666.9   12.58    6.29      0   2670.0   12.57    6.28    N/A
    67108864       8388608     float    none      -1   5305.4   12.65    6.32      0   5307.7   12.64    6.32    N/A
   134217728      16777216     float    none      -1    10581   12.68    6.34      0    10575   12.69    6.35    N/A
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.37424
#
# Collective test concluded: alltoallv_perf


======== hypercube_perf ========
# Collective test starting: hypercube_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288575 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  27172 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none      -1    10.96    0.00    0.00      0    10.64    0.00    0.00      0
           0             0     float    none      -1    10.34    0.00    0.00      0    10.19    0.00    0.00      0
          32             4     float    none      -1    16.43    0.00    0.00      0    13.15    0.00    0.00      0
          64             8     float    none      -1    16.07    0.00    0.00      0    13.45    0.00    0.00      0
         128            16     float    none      -1    15.83    0.00    0.00      0    12.81    0.00    0.00      0
         256            32     float    none      -1    15.82    0.01    0.01      0    13.18    0.01    0.01      0
         512            64     float    none      -1    15.83    0.02    0.02      0    13.26    0.02    0.02      0
        1024           128     float    none      -1    16.20    0.03    0.03      0    13.37    0.04    0.04      0
        2048           256     float    none      -1    16.37    0.06    0.06      0    13.66    0.07    0.07      0
        4096           512     float    none      -1    17.20    0.12    0.12      0    14.23    0.14    0.14      0
        8192          1024     float    none      -1    18.01    0.23    0.23      0    15.13    0.27    0.27      0
       16384          2048     float    none      -1    19.08    0.43    0.43      0    16.21    0.51    0.51      0
       32768          4096     float    none      -1    23.81    0.69    0.69      0    17.94    0.91    0.91      0
       65536          8192     float    none      -1    24.23    1.35    1.35      0    22.33    1.47    1.47      0
      131072         16384     float    none      -1    36.49    1.80    1.80      0    32.92    1.99    1.99      0
      262144         32768     float    none      -1    48.82    2.68    2.68      0    44.30    2.96    2.96      0
      524288         65536     float    none      -1    69.72    3.76    3.76      0    64.21    4.08    4.08      0
     1048576        131072     float    none      -1    110.5    4.75    4.75      0    104.2    5.03    5.03      0
     2097152        262144     float    none      -1    200.1    5.24    5.24      0    187.6    5.59    5.59      0
     4194304        524288     float    none      -1    365.2    5.74    5.74      0    351.7    5.96    5.96      0
     8388608       1048576     float    none      -1    704.3    5.96    5.96      0    693.4    6.05    6.05      0
    16777216       2097152     float    none      -1   1398.6    6.00    6.00      0   1344.2    6.24    6.24      0
    33554432       4194304     float    none      -1   2755.4    6.09    6.09      0   2670.3    6.28    6.28      0
    67108864       8388608     float    none      -1   5550.7    6.05    6.05      0   5307.5    6.32    6.32      0
   134217728      16777216     float    none      -1    11149    6.02    6.02      0    10581    6.34    6.34      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.34645
#
# Collective test concluded: hypercube_perf


======== all_reduce_bias_perf ========
# Collective test starting: all_reduce_bias_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
#  Rank  0 Group  0 Pid 288637 on SH-1 device  0 [0000:f6:00] AMD Radeon Graphics
#  Rank  1 Group  0 Pid  27244 on SH-2 device  0 [0000:f6:00] AMD Radeon Graphics
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
SH-1: Test NCCL failure /build/rccl-tests/build/src/hipify/all_reduce_bias.cu.cpp:64 'invalid usage (run with NCCL_DEBUG=WARN for details) / '
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:639
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:870
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/all_reduce_bias.cu.cpp:114
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:1002
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:1737
 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:1413
SH-2: Test NCCL failure /build/rccl-tests/build/hipify/all_reduce_bias.cu.cpp:64 'invalid usage (run with NCCL_DEBUG=WARN for details) / '
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:639
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:870
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/all_reduce_bias.cu.cpp:114
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:1002
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:1737
 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:1413
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[31958,1],1]
  Exit code:    3
--------------------------------------------------------------------------

Approval Checklist

Do not approve until these items are satisfied.

Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.

corey-derochie-amd · 2025-11-28T20:17:10Z

src/enqueue.cc

+        WARN("%s: unsupported architecture (%s) for collective %s(%s, %s, %s, %s, Acc=%d, Pipeline=%d).",
+          __func__, comm->archName,
+          ncclFuncToString(agg.func), ncclAlgoToString(agg.algorithm), ncclProtoToString(agg.protocol),
+          ncclDevRedOpToString(agg.opDev.op), ncclDatatypeToString(agg.datatype), (agg.acc != nullptr), agg.pipeline);


I'm currently reviewing whether the use of task here was intentional or not. What result was this giving you, @ChihayaK ?

I am not actually very sure, normally this path won't be used so I can revert this change if this is something intentional . But using the task here will cause the test all_reduce_bias_perf to crash. After changing from task-> to agg, it will fail normally without segfaulting. Like this:

======== all_reduce_bias_perf ======== # Collective test starting: all_reduce_bias_perf # nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # rccl-tests: Version develop:6405c76+ # Using devices # Rank 0 Group 0 Pid 288637 on SH-1 device 0 [0000:f6:00] AMD Radeon Graphics # Rank 1 Group 0 Pid 27244 on SH-2 device 0 [0000:f6:00] AMD Radeon Graphics # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) SH-1: Test NCCL failure /build/rccl-tests/build/src/hipify/all_reduce_bias.cu.cpp:64 'invalid usage (run with NCCL_DEBUG=WARN for details) / ' .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:639 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:870 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/all_reduce_bias.cu.cpp:114 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:1002 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:1737 .. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:1413 SH-2: Test NCCL failure /build/rccl-tests/build/hipify/all_reduce_bias.cu.cpp:64 'invalid usage (run with NCCL_DEBUG=WARN for details) / ' .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:639 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:870 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/all_reduce_bias.cu.cpp:114 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:1002 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:1737 .. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:1413 -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[31958,1],1] Exit code: 3 --------------------------------------------------------------------------

Fix the enqueue.cc variable refence error that causes crash when operation all_reduce_bias is called.

corey-derochie-amd reviewed Nov 28, 2025

View reviewed changes

Support gfx1151 by enable it with similar path as gfx1100.

0a0f4e8

Fix the enqueue.cc variable refence error that causes crash when operation all_reduce_bias is called.

ChihayaK force-pushed the gfx1151-support branch from 6d7f3c4 to 0a0f4e8 Compare December 1, 2025 01:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Strix Halo gfx1151 #2075

Support Strix Halo gfx1151 #2075

ChihayaK commented Nov 25, 2025 •

edited

Loading

Uh oh!

corey-derochie-amd Nov 28, 2025

Uh oh!

ChihayaK Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Support Strix Halo gfx1151 #2075

Are you sure you want to change the base?

Support Strix Halo gfx1151 #2075

Conversation

ChihayaK commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Approval Checklist

Uh oh!

corey-derochie-amd Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ChihayaK Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChihayaK commented Nov 25, 2025 •

edited

Loading