[NVIDIA] feat: adds more configurations for GB200 SGLang DSR1#335
[NVIDIA] feat: adds more configurations for GB200 SGLang DSR1#335
Conversation
This reverts commit ce40018.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
| fi | ||
| export MODEL_PATH="/mnt/lustre01/models/deepseek-r1-0528" | ||
| export CONFIG_DIR="/mnt/lustre01/artifacts/sglang-configs/1k1k" | ||
| export IMAGE="/mnt/lustre01/artifacts/containers/lmsysorg+sglang+v0.5.5.post2.sqsh" |
There was a problem hiding this comment.
Hi @functionstackx , thanks for the comment! I have updated the code in InferenceMAX/InferenceMAX@c1024db so that Dynamo+SGLang will also pull the container from nvidia-master.yaml.
There was a problem hiding this comment.
|
Hi @cquil11 @functionstackx could we have another round of review for this branch before today's nightly? |
| if [ "$ISL" = "1024" ] && [ "$OSL" = "1024" ]; then | ||
| export SGL_SLURM_JOBS_PATH="dynamo/examples/backends/sglang/slurm_jobs" | ||
| if [[ $PRECISION == "fp4" ]]; then | ||
| export MODEL_PATH="/mnt/lustre01/models/deepseek-r1-0528-fp4-v2" |
There was a problem hiding this comment.
for posterity, it would be preferable if this was retrieved from the master config
I.e., in the master config make the model field /mnt/lustre01/models/deepseek-r1-0528-fp4-v2 and then add a brief comments explaining that on the GB200 cluster, we user pre-downloaded models as opposed to the standard convention (in InferenceMAX) of downloading with HF to the HF cache)
There was a problem hiding this comment.
Thanks for the suggestion! I have addressed them through InferenceMAX/InferenceMAX@35d7555 and InferenceMAX/InferenceMAX@45cc883, which applies to TRTLLM side of code as well.
| export MODEL_PATH="/mnt/lustre01/models/deepseek-r1-0528-fp4-v2" | ||
| else | ||
| export SGL_SLURM_JOBS_PATH="dynamo/components/backends/sglang/slurm_jobs" | ||
| export MODEL_PATH="/mnt/lustre01/models/deepseek-r1-0528" |
| # For now we add conditionals to this script to use newer code for the 1k1k configs | ||
|
|
||
| ### FRAMEWORK_DIFF_IF_STATEMENT #1 - difference in setting up envvars | ||
| SQUASH_FILE="/mnt/lustre01/users/sa-shared/images/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" |
|
@yunzhoul-nv btw, after #267 , we no longer run nightly. instead, we run on an "as needed" basis when configs change |
That will be super awesome. Thanks for the info! |
|
@codex you are worthless bro 😭 |
|
To use Codex here, create an environment for this repo. |
|
@yunzhoul-nv here is the test run triggered by perf changelog https://github.com/InferenceMAX/InferenceMAX/actions/runs/20320284641 |
|
@yunzhoul-nv seems to be something wrong with results processing -- can you please have a look at the run I linked above? |
Thanks! I see that the TRTLLM job failed while SGLang is still running, so I'll wait a bit longer and observe whether SGLang will fail as well or whether this is just a TRTLLM thing. Once we can confirm that this is a TRTLLM thing, I'll chase down the cause by running TRTLLM workloads with |
Just realized that I have used the wrong model path for TRTLLM 😅 I have fixed it in the latest commit.
|
|
great, thanks! @yunzhoul-nv |
|
@functionstackx @cquil11 All pipelines have passed! I guess this MR can be safely merged. Could you give an approval on this MR? |

No description provided.