[AMD][CI] Update DeepEP branch#38396
Conversation
There was a problem hiding this comment.
Code Review
This pull request reorganizes test execution in the Buildkite configuration and updates the ROCm Dockerfile by bumping the DeepEP branch and switching the ROCSHMEM build process to a script-based approach. Feedback was provided regarding the hardcoding of GPU targets in the Dockerfile, recommending the use of the newly introduced build argument for improved configurability.
|
The test results that you have attached are showing failures. |
I'm not sure what you mean, the test command The dbo failure is due to a hidden size of 2048 not being supported. I asked the DeepEP team to update that -- looks like they did, https://github.com/ROCm/DeepEP/blob/5a8d55339794c3e01cc61f8c078bf6b8bfb2383e/csrc/kernels/launch.cuh#L132 |
| ARG DEEPEP_REPO="https://github.com/ROCm/DeepEP.git" | ||
| ARG DEEPEP_NIC="cx7" | ||
| ARG DEEPEP_ROCM_ARCH="gfx942;gfx950" |
There was a problem hiding this comment.
I believe there were plans to return 250s to the CI. Is the test not meant to run on those?
There was a problem hiding this comment.
No, ROCm DeepEP currently only supports gfx942 and gfx950, https://github.com/ROCm/DeepEP/blob/f5c5cae91892640adb52cf27907abbf0780ba566/setup.py#L104-L106
Yes, this PR partially addresses #37709. In particular the test python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughputYou can see in the nightly CI the failure, https://buildkite.com/vllm/amd-ci/builds/7612/steps/canvas?sid=019d8a93-8a56-40f5-9d3b-d4a45284fa79&tab=output#L2327-L2338, that this test fails Updating the DeepEP version to one with a setup.py that correctly builds the fat binary makes the test succeed and yields the output
|
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
6f6f686 to
7ec6924
Compare
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Purpose
Update the DeepEP branch to a version that correctly ahead-of-time compiles for gfx942 and gfx950. This partially addresses #37709
Also, move the testcase to MI325 in order to verify the change, since there are currently no MI355 agents.
Test Plan
python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughputTest Result
Exit code of 0 with the below stdout.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.