Skip to content

[NVIDIA] feat: adds more configurations for GB200 SGLang DSR1#335

Merged
cquil11 merged 27 commits intomainfrom
ishan/moreconfigs
Dec 19, 2025
Merged

[NVIDIA] feat: adds more configurations for GB200 SGLang DSR1#335
cquil11 merged 27 commits intomainfrom
ishan/moreconfigs

Conversation

@yunzhoul-nv
Copy link
Copy Markdown
Collaborator

No description provided.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Comment thread runners/launch_gb200-nv.sh Outdated
fi
export MODEL_PATH="/mnt/lustre01/models/deepseek-r1-0528"
export CONFIG_DIR="/mnt/lustre01/artifacts/sglang-configs/1k1k"
export IMAGE="/mnt/lustre01/artifacts/containers/lmsysorg+sglang+v0.5.5.post2.sqsh"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR, is it possible to have the IMAGE inherit from the nvidia-master.yaml instead of hard setting in the launcher script?

kinda like what trtllm dynamo already does?
Image

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @functionstackx , thanks for the comment! I have updated the code in InferenceMAX/InferenceMAX@c1024db so that Dynamo+SGLang will also pull the container from nvidia-master.yaml.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yunzhoul-nv
Copy link
Copy Markdown
Collaborator Author

Hi @cquil11 @functionstackx could we have another round of review for this branch before today's nightly?

Comment thread runners/launch_gb200-nv.sh Outdated
if [ "$ISL" = "1024" ] && [ "$OSL" = "1024" ]; then
export SGL_SLURM_JOBS_PATH="dynamo/examples/backends/sglang/slurm_jobs"
if [[ $PRECISION == "fp4" ]]; then
export MODEL_PATH="/mnt/lustre01/models/deepseek-r1-0528-fp4-v2"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for posterity, it would be preferable if this was retrieved from the master config
I.e., in the master config make the model field /mnt/lustre01/models/deepseek-r1-0528-fp4-v2 and then add a brief comments explaining that on the GB200 cluster, we user pre-downloaded models as opposed to the standard convention (in InferenceMAX) of downloading with HF to the HF cache)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I have addressed them through InferenceMAX/InferenceMAX@35d7555 and InferenceMAX/InferenceMAX@45cc883, which applies to TRTLLM side of code as well.

Comment thread runners/launch_gb200-nv.sh Outdated
export MODEL_PATH="/mnt/lustre01/models/deepseek-r1-0528-fp4-v2"
else
export SGL_SLURM_JOBS_PATH="dynamo/components/backends/sglang/slurm_jobs"
export MODEL_PATH="/mnt/lustre01/models/deepseek-r1-0528"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

# For now we add conditionals to this script to use newer code for the 1k1k configs

### FRAMEWORK_DIFF_IF_STATEMENT #1 - difference in setting up envvars
SQUASH_FILE="/mnt/lustre01/users/sa-shared/images/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! thanks

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Dec 17, 2025

@yunzhoul-nv btw, after #267 , we no longer run nightly. instead, we run on an "as needed" basis when configs change
so when this is merged, it will be run!

@yunzhoul-nv
Copy link
Copy Markdown
Collaborator Author

@yunzhoul-nv btw, after #267 , we no longer run nightly. instead, we run on an "as needed" basis when configs change so when this is merged, it will be run!

That will be super awesome. Thanks for the info!

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Dec 17, 2025

@codex you are worthless bro 😭

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create an environment for this repo.

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Dec 17, 2025

@yunzhoul-nv here is the test run triggered by perf changelog https://github.com/InferenceMAX/InferenceMAX/actions/runs/20320284641
gonna be slow asf til we can get oversubscription on the GB200 cluster
we thought this would be done by now, ugh 😢

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Dec 18, 2025

@yunzhoul-nv seems to be something wrong with results processing -- can you please have a look at the run I linked above?

@yunzhoul-nv
Copy link
Copy Markdown
Collaborator Author

@yunzhoul-nv seems to be something wrong with results processing -- can you please have a look at the run I linked above?

Thanks! I see that the TRTLLM job failed while SGLang is still running, so I'll wait a bit longer and observe whether SGLang will fail as well or whether this is just a TRTLLM thing. Once we can confirm that this is a TRTLLM thing, I'll chase down the cause by running TRTLLM workloads with End to End Workflow helper, and fix this issue as soon as possible.

@yunzhoul-nv
Copy link
Copy Markdown
Collaborator Author

yunzhoul-nv commented Dec 18, 2025

@yunzhoul-nv seems to be something wrong with results processing -- can you please have a look at the run I linked above?

Just realized that I have used the wrong model path for TRTLLM 😅 I have fixed it in the latest commit.

Is it possible for us to cancel our current run and retrigger another sweep on it? Thanks! Just saw that a new sweep is triggered automatically here: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20321799373?pr=335.

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Dec 18, 2025

great, thanks! @yunzhoul-nv
as mentioned, this will take... a while to run haha
working with @kedarpotdar-nv to get more runner processes listening on gb200 rack which will make it run normal speed again 💪

@yunzhoul-nv
Copy link
Copy Markdown
Collaborator Author

@functionstackx @cquil11 All pipelines have passed! I guess this MR can be safely merged. Could you give an approval on this MR?

Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

@cquil11 cquil11 merged commit c040b5c into main Dec 19, 2025
40 checks passed
@cquil11 cquil11 deleted the ishan/moreconfigs branch December 19, 2025 01:29
@github-project-automation github-project-automation Bot moved this from In Progress to Done in InferenceMAX Board Dec 19, 2025
@cquil11 cquil11 changed the title feat: adds more configurations for GB200 SGLang DSR1 [NVIDIA] feat: adds more configurations for GB200 SGLang DSR1 Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

4 participants