[NVIDIA] Qwen3.5 B200 SGLang FP4 configs by kedarpotdar-nv · Pull Request #820 · SemiAnalysisAI/InferenceX

kedarpotdar-nv · 2026-02-27T19:22:05Z

Summary

Add FP4 benchmark configuration and launch script for Qwen3.5-397B-A17B on NVIDIA B200 GPUs using SGLang.

Changes

New Benchmark Config (`nvidia-master.yaml`)

Config key: qwen3.5-fp4-b200-sglang
Model: nvidia/Qwen3.5-397B-A17B-NVFP4
Image: lmsysorg/sglang:v0.5.9-cu129-amd64
Precision: FP4 (ModelOpt NVFP4)
Sequence length configurations:
- 1k1k — TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–64), TP8/EP8 (conc 128)
- 1k8k — TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)
- 8k1k — TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)

New Launch Script (`benchmarks/single_node/qwen3.5_fp4_b200.sh`)

SGLang server configuration with:

--quantization modelopt_fp4 with --fp4-gemm-backend flashinfer_cutlass
--kv-cache-dtype fp8_e4m3
--attention-backend trtllm_mha / --moe-runner-backend flashinfer_trtllm
--enable-flashinfer-allreduce-fusion
--chunked-prefill-size 32768 / --max-prefill-tokens 32768
--disable-radix-cache
--mem-fraction-static 0.85

Perf Changelog

Added entry for the new qwen3.5-fp4-b200-sglang config.

kedarpotdar-nv · 2026-02-27T20:11:28Z

@claude please add a PR description

Klaud-Cold · 2026-02-27T20:11:53Z

Claude finished @kedarpotdar-nv's task in 40s —— View job

Adding PR Description

Gather context from changed files
Draft and update PR description

PR description has been added with a summary of all changes: new qwen3.5-fp4-b200-sglang benchmark config, launch script, and perf-changelog entry.

kedarpotdar-nv · 2026-02-27T20:14:30Z

@claude please review the PR

Klaud-Cold · 2026-02-27T20:14:50Z

Claude finished @kedarpotdar-nv's task in 2m 41s —— View job

Reviewing PR

Gather context from changed files and existing patterns
Review benchmark config (nvidia-master.yaml)
Review launch script (qwen3.5_fp4_b200.sh)
Review perf-changelog entry
Post review feedback

Review: `qwen3.5-fp4-b200-sglang`

Overall this is a clean addition. Config, script, and changelog all follow existing patterns. A few observations below — mostly minor, one worth discussing.

`.github/configs/nvidia-master.yaml` — Config

Looks good. Image (v0.5.9), model (nvidia/Qwen3.5-397B-A17B-NVFP4), and metadata fields are consistent with repo conventions. The search space is reasonable: TP4/EP1 for lower concurrencies, TP8/EP1 for wider sweeps, and a single TP8/EP8 point at conc=128 for 1k1k. No issues here.

`benchmarks/single_node/qwen3.5_fp4_b200.sh` — Launch Script

1. Missing --tokenizer-worker-num
The BF16 sibling script (qwen3.5_bf16_b200.sh) sets --tokenizer-worker-num 6. The new FP4 script omits this. At high concurrency (64–128), tokenization can become a bottleneck on a single worker. Was this intentionally dropped, or should it be carried over?

2. Hardcoded --scheduler-recv-interval 30
The BF16 script dynamically sets this (10 for CONC < 16, 30 for CONC ≥ 16). The FP4 script hardcodes 30. For low-concurrency runs (conc=4), a recv interval of 30 adds unnecessary scheduling latency — the scheduler polls less frequently, which can hurt latency at low load. Consider matching the BF16 pattern:

if [[ $CONC -ge 16 ]]; then
  SCHEDULER_RECV_INTERVAL=30
else
  SCHEDULER_RECV_INTERVAL=10
fi

Fix this →

3. No --served-model-name
The BF16 script passes --served-model-name "Qwen/Qwen3.5-397B-A17B". This script omits it. Not a blocker for benchmarking (the benchmark client uses the actual model path), but worth noting for consistency.

4. FP4-specific flags look correct

--quantization modelopt_fp4 + --fp4-gemm-backend flashinfer_cutlass — correct for NVFP4
--kv-cache-dtype fp8_e4m3 — standard for FP4 configs
--attention-backend trtllm_mha — correct for Qwen (MHA, not MLA which is DeepSeek-specific)
--moe-runner-backend flashinfer_trtllm — correct for MoE models
--mem-fraction-static 0.85 — reasonable bump from BF16's 0.82 given smaller FP4 weights

5. Environment exports inherited from BF16 script
NCCL_NVLS_ENABLE=1, SGL_ENABLE_JIT_DEEPGEMM=false, SGLANG_ENABLE_FLASHINFER_GEMM=true — all look appropriate.

`perf-changelog.yaml`

Clean entry, well-formatted, correct PR link. No issues.

Summary

Area	Verdict
Config YAML	✅ Good
Launch script	🟡 Minor — missing `--tokenizer-worker-num`, hardcoded scheduler interval
Perf changelog	✅ Good

The script is functional as-is. The two items flagged (tokenizer workers and dynamic scheduler interval) are nice-to-haves that would align with the BF16 sibling script and could improve low-concurrency latency. Up to you whether to address now or in a follow-up.

View job

kedarpotdar-nv · 2026-02-27T22:01:13Z

@functionstackx collaborating on this existing PR to get our FP8 and FP4 configs merged upstream in sgl-cookbook. sgl-project/sgl-cookbook#168

Thanks!

functionstackx · 2026-02-27T22:23:52Z

@functionstackx collaborating on this existing PR to get our FP8 and FP4 configs merged upstream in sgl-cookbook. sgl-project/sgl-cookbook#168

Thanks!

Thanks! excited to see continued focus on ensuring that the end users in the ml community can benefit from these optimized flag recipes for selecting the correct MoE kernel backend & attention kernel backend

functionstackx

lgtm

kedarpotdar-nv · 2026-02-27T23:26:19Z

/sweep run only evals

github-actions · 2026-02-27T23:26:28Z

@kedarpotdar-nv Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22507799181
Command: run only evals
Pinned ref: c738532
Approval: not required (trusted collaborator).

cquil11 · 2026-03-09T17:58:27Z

@kedarpotdar-nv some merge conflicts then looks ready to go?

kedarpotdar-nv · 2026-03-09T18:25:14Z

@ankursingh-nv will take over

functionstackx · 2026-03-29T01:07:42Z

@claude resolve the merge conflicts & update the PRs to follow the updated conventions. remember that qwen3.5-fp4-b200-sglang should be at the bottom of perf-changelog.yaml, remember to delete 1k/8k from nvidia-master, remember to do the EVAL_CONTEXT_ARGS thing

Klaud-Cold · 2026-03-29T01:08:10Z

Claude finished @functionstackx's task in 3m 16s —— View job

Resolving Merge Conflicts & Updating Conventions

Gather context (check merge conflicts, current conventions)
Reset branch to origin/main and apply changes cleanly
Add qwen3.5-fp4-b200-sglang config to nvidia-master.yaml (1k1k and 8k1k only, no 1k8k)
Create launch script with updated conventions (EVAL_CONTEXT_ARGS, dynamic scheduler interval, tokenizer-worker-num)
Add perf-changelog entry at the bottom
Commit and push changes

Summary

Resolved merge conflicts by rebasing cleanly on origin/main (single clean commit). Changes applied:

nvidia-master.yaml — Added qwen3.5-fp4-b200-sglang config with only 1k1k and 8k1k seq-len configs (removed 1k8k). Placed after qwen3.5-fp8-b200-sglang.

benchmarks/single_node/qwen3.5_fp4_b200.sh — Rewritten to follow updated conventions from the fp8 sibling script:

Dynamic SCHEDULER_RECV_INTERVAL (10 for CONC < 16, 30 for CONC >= 16)
EVAL_CONTEXT_ARGS pattern (setup_eval_context + EVAL_MAX_MODEL_LEN)
--tokenizer-worker-num 6
Conditional --enable-flashinfer-allreduce-fusion (only for TP=8)
CONTEXT_LENGTH=$((ISL + OSL + 20)) instead of hardcoded MAX_MODEL_LEN
MAX_RUNNING_REQUESTS=128, CUDA_GRAPH_MAX_BATCH_SIZE=$CONC
--data-parallel-size=1

perf-changelog.yaml — Entry added at the bottom.
| Branch

functionstackx · 2026-03-29T03:00:46Z

@kedarpotdar-nv @ankursingh-nv the eval score is not good , it doesnt even pass an relaxed floor model eval threshold https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23698363526/job/69037542640?pr=820

+viz @Oseltamivir

github-actions · 2026-03-29T03:01:07Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

jgangani

LGTM

functionstackx

recipes?

Based on SemiAnalysisAI/InferenceX#820. - Set mem-fraction-static to 0.85 for B200 FP4 (benchmark uses 0.85) - Add --quantization modelopt_fp4 (required flag, was missing) - Add --chunked-prefill-size 32768, --max-prefill-tokens 32768 - Add --max-running-requests 128, --stream-interval 30 - Add --disable-radix-cache (always required for FP4) - Skip --enable-flashinfer-allreduce-fusion for FP4 (TP=4, not used per benchmark) Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>

kedarpotdar-nv · 2026-04-08T05:59:46Z

https://github.com/sgl-project/sgl-cookbook/pull/230/files
PR for updated recipe

hshrivastava-droid · 2026-04-08T16:56:21Z

@functionstackx - could you please help reviewing this?

cquil11

nightly image fine for new arch

cquil11 · 2026-04-08T19:51:25Z

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24107321633

Evals looks good, throughput looks good
Merging

* MiniMax-M2.5 B200: add EP, FP8 KV cache, disable radix cache Based on validated benchmark configs in SemiAnalysisAI/InferenceX#1010, tp:4/ep:4 and tp:2/ep:2 are now confirmed for B200. Also enables 2-GPU selection for B200, adds --kv-cache-dtype fp8_e4m3 and --disable-radix-cache as B200-specific flags per the benchmark script. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * Update Qwen35ConfigGenerator for B200 FP4 (NVFP4) Based on SemiAnalysisAI/InferenceX#820. - Set mem-fraction-static to 0.85 for B200 FP4 (benchmark uses 0.85) - Add --quantization modelopt_fp4 (required flag, was missing) - Add --chunked-prefill-size 32768, --max-prefill-tokens 32768 - Add --max-running-requests 128, --stream-interval 30 - Add --disable-radix-cache (always required for FP4) - Skip --enable-flashinfer-allreduce-fusion for FP4 (TP=4, not used per benchmark) Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * Remove --disable-radix-cache flag for B200 in MiniMaxM25ConfigGenerator Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * revert: remove accidental MiniMax B200 changes from Qwen3.5 PR PR #230 should only touch Qwen35ConfigGenerator. Revert all changes to MiniMaxM25ConfigGenerator (B200 2-GPU support, B200 EP, B200 kv-cache-dtype) that were accidentally included on this branch. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * revert: restore MiniMax comment order to match main Undo accidental comment/variable reorder in MiniMaxM25ConfigGenerator that was not part of the intended Qwen3.5 B200 FP4 changes. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * Update Qwen3.5 config to conditionally enable allreduce fusion based on quantization --------- Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com> Co-authored-by: Zijie Xia <zijie_xia@icloud.com>

kedarpotdar-nv added the NVIDIA label Feb 27, 2026

kedarpotdar-nv requested a review from a team February 27, 2026 19:22

kedarpotdar-nv requested review from ankursingh-nv, csahithi, jgangani and yunzhoul-nv as code owners February 27, 2026 19:22

github-project-automation Bot added this to InferenceMAX Board Feb 27, 2026

kedarpotdar-nv added the sweep-enabled label Feb 27, 2026

kedarpotdar-nv requested a review from functionstackx February 27, 2026 20:11

kedarpotdar-nv requested a review from cquil11 February 27, 2026 20:11

kedarpotdar-nv self-assigned this Feb 27, 2026

kedarpotdar-nv mentioned this pull request Feb 27, 2026

Add Qwen3.5 FP8 and NVFP4 sgl-project/sgl-cookbook#168

Merged

functionstackx approved these changes Feb 27, 2026

View reviewed changes

cquil11 approved these changes Mar 4, 2026

View reviewed changes

ankursingh-nv force-pushed the nv/qwen35-fp4 branch 2 times, most recently from 86e3eec to 8d9f7c9 Compare March 9, 2026 23:17

ankursingh-nv added sweep-enabled and removed sweep-enabled labels Mar 20, 2026

Klaud-Cold force-pushed the nv/qwen35-fp4 branch from e5c143d to e1e8cb2 Compare March 29, 2026 01:11

functionstackx requested a review from cquil11 March 29, 2026 03:00

functionstackx requested review from cquil11 and functionstackx and removed request for ankursingh-nv and cquil11 March 29, 2026 03:01

kedarpotdar-nv changed the title ~~[NV] Qwen3.5 B200 SGLang FP4 configs~~ [NV - WIP] Qwen3.5 B200 SGLang FP4 configs Mar 29, 2026

hshrivastava-droid added 3 commits April 3, 2026 10:54

update image

85d02c4

Merge branch 'main' into nv/qwen35-fp4

2004f85

Merge branch 'main' into nv/qwen35-fp4

1ddee1a

hshrivastava-droid changed the title ~~[NV - WIP] Qwen3.5 B200 SGLang FP4 configs~~ Qwen3.5 B200 SGLang FP4 configs Apr 6, 2026

jgangani approved these changes Apr 6, 2026

View reviewed changes

hshrivastava-droid added 2 commits April 7, 2026 15:12

Merge branch 'main' into nv/qwen35-fp4

3d1c9fa

Update perf-changelog.yaml with new entries

23c270a

functionstackx requested changes Apr 7, 2026

View reviewed changes

faradawn mentioned this pull request Apr 8, 2026

Update Qwen3.5 ConfigGenerator for B200 FP4 (NVFP4) sgl-project/sgl-cookbook#230

Merged

6 tasks

cquil11 changed the title ~~Qwen3.5 B200 SGLang FP4 configs~~ [NVIDIA] Qwen3.5 B200 SGLang FP4 configs Apr 8, 2026

Merge branch 'main' into nv/qwen35-fp4

5a0da96

cquil11 requested changes Apr 8, 2026

View reviewed changes

Comment thread .github/configs/nvidia-master.yaml

cquil11 approved these changes Apr 8, 2026

View reviewed changes

cquil11 merged commit 48d35c4 into main Apr 8, 2026
9 of 23 checks passed

cquil11 deleted the nv/qwen35-fp4 branch April 8, 2026 19:51

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 8, 2026

Conversation

kedarpotdar-nv commented Feb 27, 2026 • edited by Klaud-Cold Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New Benchmark Config (nvidia-master.yaml)

New Launch Script (benchmarks/single_node/qwen3.5_fp4_b200.sh)

Perf Changelog

Uh oh!

kedarpotdar-nv commented Feb 27, 2026

Uh oh!

Klaud-Cold commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding PR Description

Uh oh!

kedarpotdar-nv commented Feb 27, 2026

Uh oh!

Klaud-Cold commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewing PR

Review: qwen3.5-fp4-b200-sglang

.github/configs/nvidia-master.yaml — Config

benchmarks/single_node/qwen3.5_fp4_b200.sh — Launch Script

perf-changelog.yaml

Summary

Uh oh!

kedarpotdar-nv commented Feb 27, 2026

Uh oh!

functionstackx commented Feb 27, 2026

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

kedarpotdar-nv commented Feb 27, 2026

Uh oh!

github-actions Bot commented Feb 27, 2026

Uh oh!

cquil11 commented Mar 9, 2026

Uh oh!

kedarpotdar-nv commented Mar 9, 2026

Uh oh!

functionstackx commented Mar 29, 2026

Uh oh!

Klaud-Cold commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Resolving Merge Conflicts & Updating Conventions

Summary

Uh oh!

functionstackx commented Mar 29, 2026

Uh oh!

github-actions Bot commented Mar 29, 2026

Uh oh!

jgangani left a comment

Choose a reason for hiding this comment

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

kedarpotdar-nv commented Apr 8, 2026

Uh oh!

hshrivastava-droid commented Apr 8, 2026

Uh oh!

Uh oh!

cquil11 left a comment

Choose a reason for hiding this comment

Uh oh!

cquil11 commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kedarpotdar-nv commented Feb 27, 2026 •

edited by Klaud-Cold

Loading

New Benchmark Config (`nvidia-master.yaml`)

New Launch Script (`benchmarks/single_node/qwen3.5_fp4_b200.sh`)

Klaud-Cold commented Feb 27, 2026 •

edited

Loading

Klaud-Cold commented Feb 27, 2026 •

edited

Loading

Review: `qwen3.5-fp4-b200-sglang`

`.github/configs/nvidia-master.yaml` — Config

`benchmarks/single_node/qwen3.5_fp4_b200.sh` — Launch Script

`perf-changelog.yaml`

Klaud-Cold commented Mar 29, 2026 •

edited

Loading