[NVIDIA] feat: adds more configurations for GB200 SGLang DSR1 by yunzhoul-nv · Pull Request #335 · SemiAnalysisAI/InferenceX

yunzhoul-nv · 2025-12-16T00:10:38Z

No description provided.

This reverts commit ce40018.

chatgpt-codex-connector · 2025-12-16T00:10:43Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

functionstackx · 2025-12-16T00:16:33Z

-    fi
-    export MODEL_PATH="/mnt/lustre01/models/deepseek-r1-0528"
-    export CONFIG_DIR="/mnt/lustre01/artifacts/sglang-configs/1k1k"
+    export IMAGE="/mnt/lustre01/artifacts/containers/lmsysorg+sglang+v0.5.5.post2.sqsh"


thanks for the PR, is it possible to have the IMAGE inherit from the nvidia-master.yaml instead of hard setting in the launcher script?

kinda like what trtllm dynamo already does?

Hi @functionstackx , thanks for the comment! I have updated the code in InferenceMAX/InferenceMAX@c1024db so that Dynamo+SGLang will also pull the container from nvidia-master.yaml.

closes https://github.com/InferenceMAX/InferenceMAX/issues/334

yunzhoul-nv · 2025-12-17T21:58:13Z

Hi @cquil11 @functionstackx could we have another round of review for this branch before today's nightly?

cquil11 · 2025-12-17T22:21:56Z

-    if [ "$ISL" = "1024" ] && [ "$OSL" = "1024" ]; then
-        export SGL_SLURM_JOBS_PATH="dynamo/examples/backends/sglang/slurm_jobs"
+    if [[ $PRECISION == "fp4" ]]; then
+        export MODEL_PATH="/mnt/lustre01/models/deepseek-r1-0528-fp4-v2"


for posterity, it would be preferable if this was retrieved from the master config
I.e., in the master config make the model field /mnt/lustre01/models/deepseek-r1-0528-fp4-v2 and then add a brief comments explaining that on the GB200 cluster, we user pre-downloaded models as opposed to the standard convention (in InferenceMAX) of downloading with HF to the HF cache)

Thanks for the suggestion! I have addressed them through InferenceMAX/InferenceMAX@35d7555 and InferenceMAX/InferenceMAX@45cc883, which applies to TRTLLM side of code as well.

cquil11 · 2025-12-17T22:22:01Z

+        export MODEL_PATH="/mnt/lustre01/models/deepseek-r1-0528-fp4-v2"
    else
-        export SGL_SLURM_JOBS_PATH="dynamo/components/backends/sglang/slurm_jobs"
+        export MODEL_PATH="/mnt/lustre01/models/deepseek-r1-0528"


cquil11 · 2025-12-17T22:22:54Z

 # For now we add conditionals to this script to use newer code for the 1k1k configs

 ### FRAMEWORK_DIFF_IF_STATEMENT #1 - difference in setting up envvars
+SQUASH_FILE="/mnt/lustre01/users/sa-shared/images/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"


nice! thanks

cquil11 · 2025-12-17T22:26:49Z

@yunzhoul-nv btw, after #267 , we no longer run nightly. instead, we run on an "as needed" basis when configs change
so when this is merged, it will be run!

yunzhoul-nv · 2025-12-17T22:28:01Z

@yunzhoul-nv btw, after #267 , we no longer run nightly. instead, we run on an "as needed" basis when configs change so when this is merged, it will be run!

That will be super awesome. Thanks for the info!

cquil11 · 2025-12-17T22:43:04Z

@codex you are worthless bro 😭

chatgpt-codex-connector · 2025-12-17T22:43:11Z

To use Codex here, create an environment for this repo.

cquil11 · 2025-12-17T23:30:02Z

@yunzhoul-nv here is the test run triggered by perf changelog https://github.com/InferenceMAX/InferenceMAX/actions/runs/20320284641
gonna be slow asf til we can get oversubscription on the GB200 cluster
we thought this would be done by now, ugh 😢

cquil11 · 2025-12-18T00:12:58Z

@yunzhoul-nv seems to be something wrong with results processing -- can you please have a look at the run I linked above?

yunzhoul-nv · 2025-12-18T00:28:01Z

@yunzhoul-nv seems to be something wrong with results processing -- can you please have a look at the run I linked above?

Thanks! I see that the TRTLLM job failed while SGLang is still running, so I'll wait a bit longer and observe whether SGLang will fail as well or whether this is just a TRTLLM thing. Once we can confirm that this is a TRTLLM thing, I'll chase down the cause by running TRTLLM workloads with End to End Workflow helper, and fix this issue as soon as possible.

yunzhoul-nv · 2025-12-18T00:33:33Z

@yunzhoul-nv seems to be something wrong with results processing -- can you please have a look at the run I linked above?

Just realized that I have used the wrong model path for TRTLLM 😅 I have fixed it in the latest commit.

~~Is it possible for us to cancel our current run and retrigger another sweep on it? Thanks!~~ Just saw that a new sweep is triggered automatically here: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20321799373?pr=335.

cquil11 · 2025-12-18T00:38:59Z

great, thanks! @yunzhoul-nv
as mentioned, this will take... a while to run haha
working with @kedarpotdar-nv to get more runner processes listening on gb200 rack which will make it run normal speed again 💪

yunzhoul-nv · 2025-12-19T01:23:51Z

@functionstackx @cquil11 All pipelines have passed! I guess this MR can be safely merged. Could you give an approval on this MR?

cquil11

awesome!

Elnifio added 17 commits December 9, 2025 11:00

bring all configs here

a1a0325

test for GB200 only

c03076b

updates the files and git clone urls

028f224

update the prefill nodes

25a19b1

update 1k1k fp4 config

124ddf4

updates to run 1k1k fp4 only

6199031

updates the FP4 8k1k

344ac6c

update the model path

355773a

restore changes to full sweeps

0dd1e5a

updates the config for 1k1k fp4

7da0be5

temporarily disable some concurrencies

b38b633

updates the params

8136816

updates the branch

c1f1be4

update config

7a8e890

temporarily disable all other configs

ce40018

Revert "temporarily disable all other configs"

35c7eb3

This reverts commit ce40018.

update comments

b26d699

yunzhoul-nv requested a review from cquil11 December 16, 2025 00:10

yunzhoul-nv requested a review from a team as a code owner December 16, 2025 00:10

github-project-automation Bot added this to InferenceMAX Board Dec 16, 2025

functionstackx reviewed Dec 16, 2025

View reviewed changes

cquil11 and others added 3 commits December 16, 2025 18:00

Merge branch 'main' into ishan/moreconfigs

5b0509a

bump the image for DSR1

c1024db

Merge branch 'main' into ishan/moreconfigs

3d4c3ae

cquil11 reviewed Dec 17, 2025

View reviewed changes

Elnifio and others added 5 commits December 17, 2025 14:47

update the model-path args

35d7555

model-path not permitted

45cc883

switches the branch

a6cc157

Merge branch 'main' into ishan/moreconfigs

2731ccb

add perf changelog

b3ccea8

cquil11 moved this to In Progress in InferenceMAX Board Dec 17, 2025

cquil11 added NVIDIA sweep-enabled labels Dec 17, 2025

cquil11 temporarily deployed to fork-pr-validation December 17, 2025 23:15 — with GitHub Actions Inactive

used the wrong model path here...

00dcff7

Merge branch 'main' into ishan/moreconfigs

e845bdd

cquil11 approved these changes Dec 19, 2025

View reviewed changes

cquil11 merged commit c040b5c into main Dec 19, 2025
40 checks passed

cquil11 deleted the ishan/moreconfigs branch December 19, 2025 01:29

github-project-automation Bot moved this from In Progress to Done in InferenceMAX Board Dec 19, 2025

cquil11 changed the title ~~feat: adds more configurations for GB200 SGLang DSR1~~ [NVIDIA] feat: adds more configurations for GB200 SGLang DSR1 Apr 8, 2026

Conversation

yunzhoul-nv commented Dec 16, 2025

Uh oh!

chatgpt-codex-connector Bot commented Dec 16, 2025

Uh oh!

functionstackx Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

yunzhoul-nv Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

cquil11 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

yunzhoul-nv commented Dec 17, 2025

Uh oh!

cquil11 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

yunzhoul-nv Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

cquil11 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

cquil11 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

cquil11 commented Dec 17, 2025

Uh oh!

yunzhoul-nv commented Dec 17, 2025

Uh oh!

cquil11 commented Dec 17, 2025

Uh oh!

chatgpt-codex-connector Bot commented Dec 17, 2025

Uh oh!

cquil11 commented Dec 17, 2025

Uh oh!

cquil11 commented Dec 18, 2025

Uh oh!

yunzhoul-nv commented Dec 18, 2025

Uh oh!

yunzhoul-nv commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cquil11 commented Dec 18, 2025

Uh oh!

yunzhoul-nv commented Dec 19, 2025

Uh oh!

cquil11 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yunzhoul-nv commented Dec 18, 2025 •

edited

Loading