Skip to content

[NVIDIA] Update Qwen3.5 STP recipe on B200 #758

Merged
functionstackx merged 11 commits intomainfrom
nv/qwen35-b200-update
Feb 20, 2026
Merged

[NVIDIA] Update Qwen3.5 STP recipe on B200 #758
functionstackx merged 11 commits intomainfrom
nv/qwen35-b200-update

Conversation

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator

@kedarpotdar-nv kedarpotdar-nv commented Feb 19, 2026

Summary

sglang cookbook WIP PR here - sgl-project/sgl-cookbook#169

Updates the Qwen3.5-397B-A17B BF16 SGLang benchmark launch configuration on B200 with optimized server parameters and environment tuning for improved performance.

Changes

Benchmark script (benchmarks/qwen3.5_bf16_b200.sh):

  • Environment variables: Add NCCL_NVLS_ENABLE=1, SGL_ENABLE_JIT_DEEPGEMM=false, SGLANG_ENABLE_FLASHINFER_GEMM=true, PYTHONUNBUFFERED=1
  • Attention & MOE backends: Switch to trtllm_mha attention backend and flashinfer_trtllm MOE runner
  • Memory tuning: Increase mem-fraction-static from 0.80 to 0.82
  • Prefill config: Add chunked-prefill-size and max-prefill-tokens (both 32768)
  • CUDA graphs: Set cuda-graph-max-bs to match concurrency level
  • Context length: Dynamically compute from ISL + OSL + 20
  • Scheduling: Add adaptive scheduler-recv-interval (10 for low-latency conc ≤8, 30 for max-throughput conc ≥16)
  • Other flags: Add --disable-radix-cache, --served-model-name, --trust-remote-code, --tokenizer-worker-num 6, --stream-interval 30, --max-running-requests 128, --enable-flashinfer-allreduce-fusion

Perf changelog (perf-changelog.yaml):

  • Add entry documenting the config changes for qwen3.5-bf16-b200-sglang

Image

lmsysorg/sglang:nightly-dev-20260216-d3bae71e (unchanged from initial config)

@kedarpotdar-nv kedarpotdar-nv changed the title [DO NOT MERGE] Update Qwen3.5 STP recipe on B200 Update Qwen3.5 STP recipe on B200 Feb 19, 2026
@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

@claude please write a PR description based on the diffs here

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Feb 19, 2026

Claude finished @kedarpotdar-nv's task in 40s —— View job


Writing PR Description

  • Gather context and read diffs
  • Analyze changes
  • Write PR description

PR description has been updated with a summary of all changes in this PR.

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly lgtm! Plz update https://github.com/sgl-project/sgl-cookbook before we can merge this PR since inferenceMax is about measuring the performance that end users can get. If these flags isn't documented then it is fully accessible to end Users

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

Hi @functionstackx can we merge this? i will continue working on updating sgl recipe in the background with my team.

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Feb 19, 2026

#760
will do this ASAP @kedarpotdar-nv

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

PR WIP here - sgl-project/sgl-cookbook#169

@functionstackx
Copy link
Copy Markdown
Contributor

thanks @kedarpotdar-nv

can u link the sglang qwen3.5 recipe to the nvidia recipes page too since this displays pretty high on google search https://docs.nvidia.com/llm-inference-quick-start-recipes/index.html

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

looks like this needs a complex rebase

@functionstackx
Copy link
Copy Markdown
Contributor

@kedarpotdar-nv can u try this prompt with claude?

image

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

@claude can you move the benchmarks/qwen3.5_bf16_b200.sh to benchmarks/single_node? and look at the refactor on how to do it after recent refactor .. like ohter b200 sglang scripts in benchmarks/single_node .

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Feb 19, 2026

Claude finished @kedarpotdar-nv's task in 4m 50s —— View job


Moving and Refactoring qwen3.5_bf16_b200.sh

  • Gather context: read existing B200 SGLang scripts in benchmarks/single_node/ on main
  • Read current benchmarks/qwen3.5_bf16_b200.sh
  • Check config and runner script for path mapping
  • Refactor script to match the single_node pattern
  • Move file: delete old location, create at benchmarks/single_node/
  • Commit and push changes

Moved benchmarks/qwen3.5_bf16_b200.shbenchmarks/single_node/qwen3.5_bf16_b200.sh and refactored to match the pattern of other B200 SGLang scripts (e.g., dsr1_fp8_b200.sh). Changes:

  • Source path: Updated to ../../benchmark_lib.sh (parent directory reference)
  • EP_SIZE: Added to check_env_vars and passed as --ep-size $EP_SIZE + --data-parallel-size=1 — consistent with how other B200 SGLang scripts handle it
  • Server launch format: Reformatted from multi-line \ continuation to inline style matching dsr1_fp8_b200.sh and dsr1_fp4_b200.sh
  • Comment style: Updated scheduler recv interval comment to match convention

All existing server parameters and optimizations from this PR are preserved (env vars, attention/MOE backends, memory tuning, prefill config, etc.).

Note: This branch still has other scripts in the flat benchmarks/ structure (pre-#666 reorganization). Those will need to be resolved during the rebase — this commit only addresses the qwen3.5 script.
· afb1d7d

github-actions Bot and others added 2 commits February 19, 2026 22:40
Align with the multinode reorganization (PR #666):
- Move script to benchmarks/single_node/ directory
- Update source path to reference parent benchmark_lib.sh
- Add EP_SIZE and data-parallel-size params for consistency
- Reformat server launch command to match other B200 SGLang scripts

Co-authored-by: Kedar Potdar  <kedarpotdar-nv@users.noreply.github.com>
@kedarpotdar-nv kedarpotdar-nv self-assigned this Feb 19, 2026
@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

this is good to merge @functionstackx . perf lining up after refactor.

@functionstackx
Copy link
Copy Markdown
Contributor

Lgtm @kedarpotdar-nv

Ideally u can add more runners to the cluster first before merging as merge it before u do will cause an traffic jam

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

@functionstackx multiple runners is being tested right now. @csahithi will open a PR shortly with improved cleanup logic to enable multiple runners. dont want to keep this held up for that.

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

fwiw this full sweep only takes 2 hours.

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

@functionstackx can you merge whenver? i have stopped minimax run for this to complete. shouldnt take long

@functionstackx functionstackx merged commit 9dc87a6 into main Feb 20, 2026
9 of 25 checks passed
@functionstackx functionstackx deleted the nv/qwen35-b200-update branch February 20, 2026 05:32
@cquil11 cquil11 changed the title Update Qwen3.5 STP recipe on B200 [NVIDIA] Update Qwen3.5 STP recipe on B200 Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

4 participants