Skip to content

[AMD] Use vLLM framework for DeepSeek R1 on MI325 and MI355 hardware#104

Closed
omirosh wants to merge 6 commits intoSemiAnalysisAI:mainfrom
omirosh:amd_dsr1
Closed

[AMD] Use vLLM framework for DeepSeek R1 on MI325 and MI355 hardware#104
omirosh wants to merge 6 commits intoSemiAnalysisAI:mainfrom
omirosh:amd_dsr1

Conversation

@omirosh
Copy link
Copy Markdown

@omirosh omirosh commented Oct 14, 2025

Please consider changing framework for DeepSeek R1 to vLLM, it shows better performance over SGLang.
Here is also documentation for running DeepSeek with vLLM.

@omirosh omirosh requested a review from a team as a code owner October 14, 2025 13:15
@qcolombet
Copy link
Copy Markdown
Contributor

@cquil11 , @functionstackx, I don't have the permission to assign a reviewer, so just tagging you both :).
I know you guys are figuring out the load on the CI before moving forward with the review.
Let me know what we can do to help.

@functionstackx
Copy link
Copy Markdown
Contributor

functionstackx commented Oct 17, 2025

@qcolombet yes, we are looking into it

@cquil11 is just trying to land an massive refactor PR first to reduce tech debt and then we can look into this one

@functionstackx
Copy link
Copy Markdown
Contributor

@merrymercy

@functionstackx
Copy link
Copy Markdown
Contributor

functionstackx commented Jan 5, 2026

@omirosh closing as this is stale PR. happy to take a look at amd deepseek vllm in addition to amd's sglang deepseek configs if single node is an community vllm image that supports fp4 and fp8

@cquil11 cquil11 added the AMD label Apr 8, 2026
@cquil11 cquil11 changed the title Use vLLM framework for DeepSeek R1 on MI325 and MI355 hardware [AMD] Use vLLM framework for DeepSeek R1 on MI325 and MI355 hardware Apr 8, 2026
cquil11 added a commit that referenced this pull request Apr 28, 2026
…ression

Upstream commit 52e697d (#108 "fix(nginx): raise file descriptor limit for
nginx workers") prepends `ulimit -n 1048576 &&` to the nginx srun command.
On clusters whose container inherits a sub-1M RLIMIT_NOFILE hard limit
from slurmd/PAM, the bash builtin's setrlimit fails with EPERM (raising
the hard rlimit needs CAP_SYS_RESOURCE in the init user namespace, which
pyxis --container-remap-root does not grant). The `&&` short-circuits and
nginx never starts — caught when re-running dsr1-fp4-gb200-dynamo-sglang.

Pin back to 698590e ("feat(config): cluster-wide default_bash_preamble for
ulimits and the like (#104)"), the immediately prior commit, where nginx
runs without the chained ulimit. Bump forward once upstream softens the
ulimit to `|| true` or makes it opt-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants