Skip to content

[AMD][CI] Update DeepEP branch#38396

Merged
gshtras merged 3 commits into
vllm-project:mainfrom
rjrock:amd/deepep_update
Apr 17, 2026
Merged

[AMD][CI] Update DeepEP branch#38396
gshtras merged 3 commits into
vllm-project:mainfrom
rjrock:amd/deepep_update

Conversation

@rjrock
Copy link
Copy Markdown
Contributor

@rjrock rjrock commented Mar 27, 2026

Purpose

Update the DeepEP branch to a version that correctly ahead-of-time compiles for gfx942 and gfx950. This partially addresses #37709

Also, move the testcase to MI325 in order to verify the change, since there are currently no MI355 agents.

Test Plan

python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput

Test Result

Exit code of 0 with the below stdout.

DP rank 0, Prompt: 'Hello, my name is', Generated text: ' Josh and I will be the teacher for the Braille and Computers class for this'
DP rank 0, Prompt: 'The president of the United States is', Generated text: ' the most powerful person in the world. The President is the head of the executive'
DP rank 0, Prompt: 'The capital of France is', Generated text: '______.\nA. London\nB. Paris\nC. New York\n'
DP rank 0, Prompt: 'The future of AI is', Generated text: ' being decided in Cambridge\nArtificial intelligence (AI) is one of the most'
DP rank 0, Prompt: 'Hello, my name is', Generated text: ' Belinda and I am a 43 year old woman who is passionate about'
DP rank 1, Prompt: 'Hello, my name is', Generated text: ' Josh and I will be the teacher for the Braille and Literacy Teaching in the 21'
DP rank 1, Prompt: 'The president of the United States is', Generated text: ' the most powerful person in the world. The President is the head of the executive branch of the U'
DP rank 1, Prompt: 'The capital of France is', Generated text: ' ______.\nA. Berlin\nB. London\nC. Madrid\nD. Paris\n答案:\n'
DP rank 1, Prompt: 'The future of AI is', Generated text: ' being determined right now – by you.\nFor all the excitement about the transformative power of artificial intelligence,'
DP rank 1, Prompt: 'Hello, my name is', Generated text: ' Alan Belcher and I am a 43 year old male. I am a 20'


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify Bot added ci/build rocm Related to AMD ROCm labels Mar 27, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Mar 27, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reorganizes test execution in the Buildkite configuration and updates the ROCm Dockerfile by bumping the DeepEP branch and switching the ROCSHMEM build process to a script-based approach. Feedback was provided regarding the hardcoding of GPU targets in the Dockerfile, recommending the use of the newly introduced build argument for improved configurability.

Comment thread docker/Dockerfile.rocm Outdated
@rjrock rjrock marked this pull request as ready for review March 30, 2026 17:27
@rjrock rjrock requested review from gshtras and tjtanaa as code owners March 30, 2026 17:27
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@rjrock
Copy link
Copy Markdown
Contributor Author

rjrock commented Mar 30, 2026

@tjtanaa
Copy link
Copy Markdown
Collaborator

tjtanaa commented Apr 10, 2026

@rjrock
Copy link
Copy Markdown
Contributor Author

rjrock commented Apr 10, 2026

passed CI, https://buildkite.com/vllm/amd-ci/builds/7017/steps/canvas?sid=019d3123-242c-4369-82fd-88803aad05da&tab=output#019d3123-258b-49d1-8c1c-de73727ce6f8/L2337-L2341

The test results that you have attached are showing failures.

I'm not sure what you mean, the test command python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput passes. This PR updates the DeepEP version so during link time DeepEP links the appropriate kernels into its fat binary, see ROCm/DeepEP#15.

The dbo failure is due to a hidden size of 2048 not being supported. I asked the DeepEP team to update that -- looks like they did, https://github.com/ROCm/DeepEP/blob/5a8d55339794c3e01cc61f8c078bf6b8bfb2383e/csrc/kernels/launch.cuh#L132

Comment thread docker/Dockerfile.rocm
ARG DEEPEP_REPO="https://github.com/ROCm/DeepEP.git"
ARG DEEPEP_NIC="cx7"
ARG DEEPEP_ROCM_ARCH="gfx942;gfx950"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there were plans to return 250s to the CI. Is the test not meant to run on those?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @rjrock But can you attach the link the AMD CI that this is going to fix and what status we should expect? I will add the ready label.

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 14, 2026
@rjrock
Copy link
Copy Markdown
Contributor Author

rjrock commented Apr 14, 2026

LGTM. @rjrock But can you attach the link the AMD CI that this is going to fix and what status we should expect? I will add the ready label.

Yes, this PR partially addresses #37709. In particular the test

python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput

You can see in the nightly CI the failure, https://buildkite.com/vllm/amd-ci/builds/7612/steps/canvas?sid=019d8a93-8a56-40f5-9d3b-d4a45284fa79&tab=output#L2327-L2338, that this test fails

(EngineCore_DP0 pid=2615) RuntimeError: Worker failed with error 'Failed: CUDA error /app/DeepEP/csrc/kernels/launch_hip.cuh:71 'invalid kernel file'', please check the stack trace above for the root cause
--
(EngineCore_DP0 pid=2615) DEBUG 04-14 06:32:27 [distributed/device_communicators/shm_broadcast.py:168] Canceling waiting reads on SHM Buffer
DEBUG 04-14 06:32:28 [v1/engine/utils.py:1143] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 04-14 06:32:28 [v1/engine/utils.py:1143] Waiting for 1 local, 0 remote core engine proc(s) to start.
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/vllm-workspace/examples/offline_inference/data_parallel.py", line 137, in main
    llm = LLM(**engine_args)

Updating the DeepEP version to one with a setup.py that correctly builds the fat binary makes the test succeed and yields the output

DP rank 0, Prompt: 'Hello, my name is', Generated text: ' Josh and I will be the teacher for the Braille and Computers class for this'
DP rank 0, Prompt: 'The president of the United States is', Generated text: ' the most powerful person in the world. The President is the head of the executive'
DP rank 0, Prompt: 'The capital of France is', Generated text: '______.\nA. London\nB. Paris\nC. New York\n'
DP rank 0, Prompt: 'The future of AI is', Generated text: ' being decided in Cambridge\nArtificial intelligence (AI) is one of the most'
DP rank 0, Prompt: 'Hello, my name is', Generated text: ' Belinda and I am a 43 year old woman who is passionate about'
DP rank 1, Prompt: 'Hello, my name is', Generated text: ' Josh and I will be the teacher for the Braille and Literacy Teaching in the 21'
DP rank 1, Prompt: 'The president of the United States is', Generated text: ' the most powerful person in the world. The President is the head of the executive branch of the U'
DP rank 1, Prompt: 'The capital of France is', Generated text: ' ______.\nA. Berlin\nB. London\nC. Madrid\nD. Paris\n答案:\n'
DP rank 1, Prompt: 'The future of AI is', Generated text: ' being determined right now – by you.\nFor all the excitement about the transformative power of artificial intelligence,'
DP rank 1, Prompt: 'Hello, my name is', Generated text: ' Alan Belcher and I am a 43 year old male. I am a 20'

rjrock added 3 commits April 16, 2026 16:39
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
@rjrock rjrock force-pushed the amd/deepep_update branch from 6f6f686 to 7ec6924 Compare April 16, 2026 21:39
@gshtras gshtras merged commit 58da4ee into vllm-project:main Apr 17, 2026
13 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Apr 17, 2026
@rjrock rjrock deleted the amd/deepep_update branch April 17, 2026 19:52
bnellnm pushed a commit to neuralmagic/vllm that referenced this pull request Apr 20, 2026
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
baonudesifeizhai pushed a commit to baonudesifeizhai/vllm that referenced this pull request Apr 23, 2026
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants