-
Notifications
You must be signed in to change notification settings - Fork 194
Support Strix Halo gfx1151 #2075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
| WARN("%s: unsupported architecture (%s) for collective %s(%s, %s, %s, %s, Acc=%d, Pipeline=%d).", | ||
| __func__, comm->archName, | ||
| ncclFuncToString(agg.func), ncclAlgoToString(agg.algorithm), ncclProtoToString(agg.protocol), | ||
| ncclDevRedOpToString(agg.opDev.op), ncclDatatypeToString(agg.datatype), (agg.acc != nullptr), agg.pipeline); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently reviewing whether the use of task here was intentional or not. What result was this giving you, @ChihayaK ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not actually very sure, normally this path won't be used so I can revert this change if this is something intentional . But using the task here will cause the test all_reduce_bias_perf to crash. After changing from task-> to agg, it will fail normally without segfaulting. Like this:
======== all_reduce_bias_perf ========
# Collective test starting: all_reduce_bias_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:6405c76+
# Using devices
# Rank 0 Group 0 Pid 288637 on SH-1 device 0 [0000:f6:00] AMD Radeon Graphics
# Rank 1 Group 0 Pid 27244 on SH-2 device 0 [0000:f6:00] AMD Radeon Graphics
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
SH-1: Test NCCL failure /build/rccl-tests/build/src/hipify/all_reduce_bias.cu.cpp:64 'invalid usage (run with NCCL_DEBUG=WARN for details) / '
.. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:639
.. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:870
.. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/all_reduce_bias.cu.cpp:114
.. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:1002
.. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:1737
.. SH-1 pid 288637: Test failure /build/rccl-tests/build/src/hipify/common.cu.cpp:1413
SH-2: Test NCCL failure /build/rccl-tests/build/hipify/all_reduce_bias.cu.cpp:64 'invalid usage (run with NCCL_DEBUG=WARN for details) / '
.. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:639
.. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:870
.. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/all_reduce_bias.cu.cpp:114
.. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:1002
.. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:1737
.. SH-2 pid 27244: Test failure /build/rccl-tests/build/hipify/common.cu.cpp:1413
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[31958,1],1]
Exit code: 3
--------------------------------------------------------------------------
Fix the enqueue.cc variable refence error that causes crash when operation all_reduce_bias is called.
6d7f3c4 to
0a0f4e8
Compare
Details
Work item: #2026
What were the changes?
Addresses issue #2026.
Supported gfx1151 by enabling it with a similar path as gfx1100.
Fixed the enqueue.cc variable reference error that caused a crash when the all_reduce_bias operation was called.
Why were the changes made?
To support vLLM inference across two Strix Halo devices.
How was the outcome achieved?
It just works; I just needed to enable it in the codebase. To verify it works, I tested rccl-test with gfx1151 support. With
NCCL_DMABUF_ENABLE=111/12 tests passed. The only test that did not pass was the all_reduce_bias_perf test, which caused a segmentation fault. Further investigation showed that the function that stops these kinds of unsupported architectures is broken.The vllm works with RCCL enabled with gfx1151 when tested with Qwen3-4B across two nodes. The pipeline parallel will fail if not disabled the cuda graph, but the tensor parallel works without any issue it seems. I also tested full fp16 llama3.3-70b weight with vllm with tp=2, it runs if cuda graph is disabled (but only getting 1.8 tokens/s and fits the math of at ~250G/s ish memory speed and consider slow connection between two nodes). Which kind of proofs that the current code base can be used to support inference across multiple nodes.
Additional Documentation:
Rccl-test results
Click to show results
Approval Checklist
Do not approve until these items are satisfied.