[NVIDIA] Update Qwen3.5 STP recipe on B200 #758
Conversation
|
@claude please write a PR description based on the diffs here |
|
Claude finished @kedarpotdar-nv's task in 40s —— View job Writing PR Description
PR description has been updated with a summary of all changes in this PR. |
functionstackx
left a comment
There was a problem hiding this comment.
Mostly lgtm! Plz update https://github.com/sgl-project/sgl-cookbook before we can merge this PR since inferenceMax is about measuring the performance that end users can get. If these flags isn't documented then it is fully accessible to end Users
|
Hi @functionstackx can we merge this? i will continue working on updating sgl recipe in the background with my team. |
|
#760 |
|
PR WIP here - sgl-project/sgl-cookbook#169 |
|
thanks @kedarpotdar-nv can u link the sglang qwen3.5 recipe to the nvidia recipes page too since this displays pretty high on google search https://docs.nvidia.com/llm-inference-quick-start-recipes/index.html |
|
looks like this needs a complex rebase |
|
@kedarpotdar-nv can u try this prompt with claude?
|
|
@claude can you move the benchmarks/qwen3.5_bf16_b200.sh to benchmarks/single_node? and look at the refactor on how to do it after recent refactor .. like ohter b200 sglang scripts in benchmarks/single_node . |
|
Claude finished @kedarpotdar-nv's task in 4m 50s —— View job Moving and Refactoring
|
Align with the multinode reorganization (PR #666): - Move script to benchmarks/single_node/ directory - Update source path to reference parent benchmark_lib.sh - Add EP_SIZE and data-parallel-size params for consistency - Reformat server launch command to match other B200 SGLang scripts Co-authored-by: Kedar Potdar <kedarpotdar-nv@users.noreply.github.com>
|
this is good to merge @functionstackx . perf lining up after refactor. |
|
Lgtm @kedarpotdar-nv Ideally u can add more runners to the cluster first before merging as merge it before u do will cause an traffic jam |
|
@functionstackx multiple runners is being tested right now. @csahithi will open a PR shortly with improved cleanup logic to enable multiple runners. dont want to keep this held up for that. |
|
fwiw this full sweep only takes 2 hours. |
|
@functionstackx can you merge whenver? i have stopped minimax run for this to complete. shouldnt take long |

Summary
sglang cookbook WIP PR here - sgl-project/sgl-cookbook#169
Updates the Qwen3.5-397B-A17B BF16 SGLang benchmark launch configuration on B200 with optimized server parameters and environment tuning for improved performance.
Changes
Benchmark script (
benchmarks/qwen3.5_bf16_b200.sh):NCCL_NVLS_ENABLE=1,SGL_ENABLE_JIT_DEEPGEMM=false,SGLANG_ENABLE_FLASHINFER_GEMM=true,PYTHONUNBUFFERED=1trtllm_mhaattention backend andflashinfer_trtllmMOE runnermem-fraction-staticfrom 0.80 to 0.82chunked-prefill-sizeandmax-prefill-tokens(both 32768)cuda-graph-max-bsto match concurrency levelISL + OSL + 20scheduler-recv-interval(10 for low-latency conc ≤8, 30 for max-throughput conc ≥16)--disable-radix-cache,--served-model-name,--trust-remote-code,--tokenizer-worker-num 6,--stream-interval 30,--max-running-requests 128,--enable-flashinfer-allreduce-fusionPerf changelog (
perf-changelog.yaml):qwen3.5-bf16-b200-sglangImage
lmsysorg/sglang:nightly-dev-20260216-d3bae71e(unchanged from initial config)