Skip to content

[NVIDIA] Qwen3.5 B200 SGLang FP4 configs#820

Merged
cquil11 merged 7 commits intomainfrom
nv/qwen35-fp4
Apr 8, 2026
Merged

[NVIDIA] Qwen3.5 B200 SGLang FP4 configs#820
cquil11 merged 7 commits intomainfrom
nv/qwen35-fp4

Conversation

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator

@kedarpotdar-nv kedarpotdar-nv commented Feb 27, 2026

Summary

Add FP4 benchmark configuration and launch script for Qwen3.5-397B-A17B on NVIDIA B200 GPUs using SGLang.

Changes

New Benchmark Config (nvidia-master.yaml)

  • Config key: qwen3.5-fp4-b200-sglang
  • Model: nvidia/Qwen3.5-397B-A17B-NVFP4
  • Image: lmsysorg/sglang:v0.5.9-cu129-amd64
  • Precision: FP4 (ModelOpt NVFP4)
  • Sequence length configurations:
    • 1k1k — TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–64), TP8/EP8 (conc 128)
    • 1k8k — TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)
    • 8k1k — TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)

New Launch Script (benchmarks/single_node/qwen3.5_fp4_b200.sh)

SGLang server configuration with:

  • --quantization modelopt_fp4 with --fp4-gemm-backend flashinfer_cutlass
  • --kv-cache-dtype fp8_e4m3
  • --attention-backend trtllm_mha / --moe-runner-backend flashinfer_trtllm
  • --enable-flashinfer-allreduce-fusion
  • --chunked-prefill-size 32768 / --max-prefill-tokens 32768
  • --disable-radix-cache
  • --mem-fraction-static 0.85

Perf Changelog

  • Added entry for the new qwen3.5-fp4-b200-sglang config.

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

@claude please add a PR description

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Feb 27, 2026

Claude finished @kedarpotdar-nv's task in 40s —— View job


Adding PR Description

  • Gather context from changed files
  • Draft and update PR description

PR description has been added with a summary of all changes: new qwen3.5-fp4-b200-sglang benchmark config, launch script, and perf-changelog entry.

@kedarpotdar-nv kedarpotdar-nv self-assigned this Feb 27, 2026
@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

@claude please review the PR

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Feb 27, 2026

Claude finished @kedarpotdar-nv's task in 2m 41s —— View job


Reviewing PR

  • Gather context from changed files and existing patterns
  • Review benchmark config (nvidia-master.yaml)
  • Review launch script (qwen3.5_fp4_b200.sh)
  • Review perf-changelog entry
  • Post review feedback

Review: qwen3.5-fp4-b200-sglang

Overall this is a clean addition. Config, script, and changelog all follow existing patterns. A few observations below — mostly minor, one worth discussing.


.github/configs/nvidia-master.yaml — Config

Looks good. Image (v0.5.9), model (nvidia/Qwen3.5-397B-A17B-NVFP4), and metadata fields are consistent with repo conventions. The search space is reasonable: TP4/EP1 for lower concurrencies, TP8/EP1 for wider sweeps, and a single TP8/EP8 point at conc=128 for 1k1k. No issues here.


benchmarks/single_node/qwen3.5_fp4_b200.sh — Launch Script

1. Missing --tokenizer-worker-num
The BF16 sibling script (qwen3.5_bf16_b200.sh) sets --tokenizer-worker-num 6. The new FP4 script omits this. At high concurrency (64–128), tokenization can become a bottleneck on a single worker. Was this intentionally dropped, or should it be carried over?

2. Hardcoded --scheduler-recv-interval 30
The BF16 script dynamically sets this (10 for CONC < 16, 30 for CONC ≥ 16). The FP4 script hardcodes 30. For low-concurrency runs (conc=4), a recv interval of 30 adds unnecessary scheduling latency — the scheduler polls less frequently, which can hurt latency at low load. Consider matching the BF16 pattern:

if [[ $CONC -ge 16 ]]; then
  SCHEDULER_RECV_INTERVAL=30
else
  SCHEDULER_RECV_INTERVAL=10
fi

Fix this →

3. No --served-model-name
The BF16 script passes --served-model-name "Qwen/Qwen3.5-397B-A17B". This script omits it. Not a blocker for benchmarking (the benchmark client uses the actual model path), but worth noting for consistency.

4. FP4-specific flags look correct

  • --quantization modelopt_fp4 + --fp4-gemm-backend flashinfer_cutlass — correct for NVFP4
  • --kv-cache-dtype fp8_e4m3 — standard for FP4 configs
  • --attention-backend trtllm_mha — correct for Qwen (MHA, not MLA which is DeepSeek-specific)
  • --moe-runner-backend flashinfer_trtllm — correct for MoE models
  • --mem-fraction-static 0.85 — reasonable bump from BF16's 0.82 given smaller FP4 weights

5. Environment exports inherited from BF16 script
NCCL_NVLS_ENABLE=1, SGL_ENABLE_JIT_DEEPGEMM=false, SGLANG_ENABLE_FLASHINFER_GEMM=true — all look appropriate.


perf-changelog.yaml

Clean entry, well-formatted, correct PR link. No issues.


Summary

Area Verdict
Config YAML ✅ Good
Launch script 🟡 Minor — missing --tokenizer-worker-num, hardcoded scheduler interval
Perf changelog ✅ Good

The script is functional as-is. The two items flagged (tokenizer workers and dynamic scheduler interval) are nice-to-haves that would align with the BF16 sibling script and could improve low-concurrency latency. Up to you whether to address now or in a follow-up.

View job

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

@functionstackx collaborating on this existing PR to get our FP8 and FP4 configs merged upstream in sgl-cookbook. sgl-project/sgl-cookbook#168

Thanks!

@functionstackx
Copy link
Copy Markdown
Contributor

@functionstackx collaborating on this existing PR to get our FP8 and FP4 configs merged upstream in sgl-cookbook. sgl-project/sgl-cookbook#168

Thanks!

Thanks! excited to see continued focus on ensuring that the end users in the ml community can benefit from these optimized flag recipes for selecting the correct MoE kernel backend & attention kernel backend

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

/sweep run only evals

@github-actions
Copy link
Copy Markdown
Contributor

@kedarpotdar-nv Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22507799181
Command: run only evals
Pinned ref: c738532
Approval: not required (trusted collaborator).

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Mar 9, 2026

@kedarpotdar-nv some merge conflicts then looks ready to go?

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

@ankursingh-nv will take over

@ankursingh-nv ankursingh-nv force-pushed the nv/qwen35-fp4 branch 2 times, most recently from 86e3eec to 8d9f7c9 Compare March 9, 2026 23:17
@functionstackx
Copy link
Copy Markdown
Contributor

@claude resolve the merge conflicts & update the PRs to follow the updated conventions. remember that qwen3.5-fp4-b200-sglang should be at the bottom of perf-changelog.yaml, remember to delete 1k/8k from nvidia-master, remember to do the EVAL_CONTEXT_ARGS thing

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Mar 29, 2026

Claude finished @functionstackx's task in 3m 16s —— View job


Resolving Merge Conflicts & Updating Conventions

  • Gather context (check merge conflicts, current conventions)
  • Reset branch to origin/main and apply changes cleanly
  • Add qwen3.5-fp4-b200-sglang config to nvidia-master.yaml (1k1k and 8k1k only, no 1k8k)
  • Create launch script with updated conventions (EVAL_CONTEXT_ARGS, dynamic scheduler interval, tokenizer-worker-num)
  • Add perf-changelog entry at the bottom
  • Commit and push changes

Summary

Resolved merge conflicts by rebasing cleanly on origin/main (single clean commit). Changes applied:

nvidia-master.yaml — Added qwen3.5-fp4-b200-sglang config with only 1k1k and 8k1k seq-len configs (removed 1k8k). Placed after qwen3.5-fp8-b200-sglang.

benchmarks/single_node/qwen3.5_fp4_b200.sh — Rewritten to follow updated conventions from the fp8 sibling script:

  • Dynamic SCHEDULER_RECV_INTERVAL (10 for CONC < 16, 30 for CONC >= 16)
  • EVAL_CONTEXT_ARGS pattern (setup_eval_context + EVAL_MAX_MODEL_LEN)
  • --tokenizer-worker-num 6
  • Conditional --enable-flashinfer-allreduce-fusion (only for TP=8)
  • CONTEXT_LENGTH=$((ISL + OSL + 20)) instead of hardcoded MAX_MODEL_LEN
  • MAX_RUNNING_REQUESTS=128, CUDA_GRAPH_MAX_BATCH_SIZE=$CONC
  • --data-parallel-size=1

perf-changelog.yaml — Entry added at the bottom.
| Branch

@functionstackx
Copy link
Copy Markdown
Contributor

@kedarpotdar-nv @ankursingh-nv the eval score is not good , it doesnt even pass an relaxed floor model eval threshold https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23698363526/job/69037542640?pr=820

image

+viz @Oseltamivir

@functionstackx functionstackx requested a review from cquil11 March 29, 2026 03:00
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

@functionstackx functionstackx requested review from cquil11 and functionstackx and removed request for ankursingh-nv and cquil11 March 29, 2026 03:01
@kedarpotdar-nv kedarpotdar-nv changed the title [NV] Qwen3.5 B200 SGLang FP4 configs [NV - WIP] Qwen3.5 B200 SGLang FP4 configs Mar 29, 2026
@hshrivastava-droid hshrivastava-droid changed the title [NV - WIP] Qwen3.5 B200 SGLang FP4 configs Qwen3.5 B200 SGLang FP4 configs Apr 6, 2026
Copy link
Copy Markdown
Collaborator

@jgangani jgangani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recipes?

faradawn added a commit to faradawn/sgl-cookbook that referenced this pull request Apr 8, 2026
Based on SemiAnalysisAI/InferenceX#820.

- Set mem-fraction-static to 0.85 for B200 FP4 (benchmark uses 0.85)
- Add --quantization modelopt_fp4 (required flag, was missing)
- Add --chunked-prefill-size 32768, --max-prefill-tokens 32768
- Add --max-running-requests 128, --stream-interval 30
- Add --disable-radix-cache (always required for FP4)
- Skip --enable-flashinfer-allreduce-fusion for FP4 (TP=4, not used per benchmark)

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

@hshrivastava-droid
Copy link
Copy Markdown
Collaborator

@functionstackx - could you please help reviewing this?

@cquil11 cquil11 changed the title Qwen3.5 B200 SGLang FP4 configs [NVIDIA] Qwen3.5 B200 SGLang FP4 configs Apr 8, 2026
Comment thread .github/configs/nvidia-master.yaml
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nightly image fine for new arch

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Apr 8, 2026

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24107321633

Evals looks good, throughput looks good
Merging

@cquil11 cquil11 merged commit 48d35c4 into main Apr 8, 2026
9 of 23 checks passed
@cquil11 cquil11 deleted the nv/qwen35-fp4 branch April 8, 2026 19:51
zijiexia added a commit to sgl-project/sgl-cookbook that referenced this pull request Apr 10, 2026
* MiniMax-M2.5 B200: add EP, FP8 KV cache, disable radix cache

Based on validated benchmark configs in SemiAnalysisAI/InferenceX#1010,
tp:4/ep:4 and tp:2/ep:2 are now confirmed for B200. Also enables 2-GPU
selection for B200, adds --kv-cache-dtype fp8_e4m3 and --disable-radix-cache
as B200-specific flags per the benchmark script.

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

* Update Qwen35ConfigGenerator for B200 FP4 (NVFP4)

Based on SemiAnalysisAI/InferenceX#820.

- Set mem-fraction-static to 0.85 for B200 FP4 (benchmark uses 0.85)
- Add --quantization modelopt_fp4 (required flag, was missing)
- Add --chunked-prefill-size 32768, --max-prefill-tokens 32768
- Add --max-running-requests 128, --stream-interval 30
- Add --disable-radix-cache (always required for FP4)
- Skip --enable-flashinfer-allreduce-fusion for FP4 (TP=4, not used per benchmark)

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

* Remove --disable-radix-cache flag for B200 in MiniMaxM25ConfigGenerator

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

* revert: remove accidental MiniMax B200 changes from Qwen3.5 PR

PR #230 should only touch Qwen35ConfigGenerator. Revert all changes to
MiniMaxM25ConfigGenerator (B200 2-GPU support, B200 EP, B200 kv-cache-dtype)
that were accidentally included on this branch.

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

* revert: restore MiniMax comment order to match main

Undo accidental comment/variable reorder in MiniMaxM25ConfigGenerator
that was not part of the intended Qwen3.5 B200 FP4 changes.

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

* Update Qwen3.5 config to conditionally enable allreduce fusion based on quantization

---------

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
Co-authored-by: Zijie Xia <zijie_xia@icloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

7 participants