Add Qwen3.5-397B-A17B BF16 B200 SGLang benchmark (STP only) by functionstackx · Pull Request #704 · SemiAnalysisAI/InferenceX

functionstackx · 2026-02-16T19:04:02Z

Add Qwen3.5-397B-A17B FP8 B200 SGLang benchmark script and nvidia-master.yaml config STP only

New benchmark script benchmarks/qwen3.5_fp8_b200.sh with basic SGLang launch flags
Image: lmsysorg/sglang:nightly-dev-20260216-d3bae71e
TP=8, concurrency 4-64 for 1k1k, 1k8k, 8k1k
Updated perf-changelog.yaml

Closes #702

Simple benchmark script without speculative decoding or MTP flags: - No SGLANG_USE_CUDA_IPC_TRANSPORT - No --reasoning-parser qwen3 - No --speculative-algo NEXTN or related flags - TP=8, --mem-fraction-static 0.8 - Image: lmsysorg/sglang:nightly-dev-20260216-d3bae71e Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>

functionstackx · 2026-02-16T19:04:12Z

@pr-claude review please

functionstackx · 2026-02-16T19:10:43Z

@claude review please

Klaud-Cold · 2026-02-16T19:10:59Z

Claude finished @functionstackx's task in 3m 21s —— View job

Review of PR #704 — Qwen3.5-397B-A17B FP8 B200 SGLang Benchmark

Read changed files and gather context
Review nvidia-master.yaml changes
Review benchmarks/qwen3.5_fp8_b200.sh
Review perf-changelog.yaml changes
Compare with existing benchmarks for consistency
Post review feedback

Summary

The PR adds a new benchmark config for Qwen3.5-397B-A17B FP8 on B200 with SGLang. The config YAML and perf-changelog entries look correct structurally. The main concern is that the benchmark script is very bare compared to existing B200 SGLang scripts and is missing the --quantization fp8 flag despite being labeled as an FP8 benchmark.

Issues

1. Missing `--quantization fp8` flag (Bug)

benchmarks/qwen3.5_fp8_b200.sh:30-35 — The server launch command has no FP8 quantization flag. The config key is qwen3.5-fp8-b200-sglang and precision is fp8, but the script doesn't pass --quantization fp8 to SGLang. Compare with dsr1_fp8_b200.sh:74 which passes --quantization fp8. Without this, the model will load in BF16, not FP8, making the benchmark results misleading.

# Current (no quantization flag):
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server \
    --model-path=$MODEL \
    --host=0.0.0.0 \
    --port=$PORT \
    --tensor-parallel-size=$TP \
    --mem-fraction-static $MEM_FRAC_STATIC \
    > $SERVER_LOG 2>&1 &

# Should include:
    --quantization fp8 \

Fix this →

2. Missing `--kv-cache-dtype fp8_e4m3` flag

All existing FP8 SGLang B200 benchmarks (dsr1_fp8_b200.sh:72, dsr1_fp8_b200_mtp.sh:69) include --kv-cache-dtype fp8_e4m3 to use FP8 KV cache. This is absent from the new script, which will default to BF16/FP16 KV cache and use significantly more memory.

3. PR title says "STP only" but there's no speculative decoding configured

The PR title says "(STP only)" but the script has no speculative decoding flags. The SGLang cookbook for Qwen3.5 recommends NEXTN speculative decoding:

--speculative-algo NEXTN --speculative-num-steps 3 \
--speculative-eagle-topk 1 --speculative-num-draft-tokens 4

If "STP" means "Standard Token Prediction" (i.e., no speculative decoding), this is fine and intentional — just noting the cookbook recommendation for a potential future MTP/STP variant.

4. Missing `EP_SIZE` env var and `--ep-size` flag

The nvidia-master.yaml config includes ep: 1 in the search space, but the benchmark script doesn't check for EP_SIZE in check_env_vars and doesn't pass --ep-size to the server. Compare with dsr1_fp8_b200.sh:13 which requires EP_SIZE and passes it at line 74. Qwen3.5-397B-A17B is a MoE model (A17B active out of 397B), so expert parallelism configuration is relevant.

Fix this →

Minor Issues

5. Missing performance tuning flags

Compared to the existing dsr1_fp8_b200.sh, the new script is missing several SGLang performance flags that other B200 benchmarks use:

--data-parallel-size=1
--cuda-graph-max-bs / --max-running-requests (memory optimization)
--chunked-prefill-size / --max-prefill-tokens
--disable-radix-cache
--stream-interval

The issue author explicitly asked to skip the complex TP-conditional tuning logic from dsr1_fp8_b200.sh, which is reasonable. However, some of these flags (especially --disable-radix-cache for throughput benchmarks) are likely still beneficial. This may be intentional for a baseline, but worth noting.

6. `perf-changelog.yaml` — PR link placeholder

perf-changelog.yaml:646 has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. Should be updated to pull/704.

Fix this →

7. SGLang cookbook recommends `SGLANG_USE_CUDA_IPC_TRANSPORT=1`

The SGLang Qwen3.5 cookbook recommends SGLANG_USE_CUDA_IPC_TRANSPORT=1 as an environment variable. This isn't used in other B200 scripts either, so it may not apply, but worth investigating.

What looks good

Config key naming follows the {model}-{precision}-{runner}-{framework} convention
model-prefix: qwen3.5 correctly matches the script name mapping (qwen3.5_fp8_b200.sh)
Seq-len configs (1k1k, 1k8k, 8k1k) with TP=8 and conc 4-64 are reasonable for initial benchmarking
perf-changelog.yaml entry is well-structured
Script correctly uses benchmark_lib.sh helpers

Recommendation

The --quantization fp8 and --kv-cache-dtype fp8_e4m3 flags are the most critical missing pieces — without them this isn't actually an FP8 benchmark. The EP_SIZE issue should also be addressed since Qwen3.5 is a MoE model. The missing performance tuning flags are less critical for an initial baseline but should be revisited once initial numbers are in.

functionstackx · 2026-02-16T19:37:22Z

prob gonna need to an small refactor since it is bf16

functionstackx · 2026-02-17T03:24:26Z

@claude oops, the precision of the command i pasted is already bf16, can u change the file name to be bf16 and nvidia-master.yaml and perf changelog. yaml

Klaud-Cold · 2026-02-17T03:24:46Z

Claude finished @functionstackx's task in 1m 18s —— View job

Renaming FP8 → BF16 across benchmark files

Read current files for context
Rename benchmarks/qwen3.5_fp8_b200.sh → benchmarks/qwen3.5_bf16_b200.sh
Update nvidia-master.yaml: config key qwen3.5-fp8-b200-sglang → qwen3.5-bf16-b200-sglang, precision fp8 → bf16
Update perf-changelog.yaml: config key and description updated to BF16, PR link set to Add Qwen3.5-397B-A17B BF16 B200 SGLang benchmark (STP only) #704
Commit and push changes

All three files updated and pushed in commit ce477cf. The benchmark now correctly identifies as BF16 since the script doesn't pass --quantization fp8 to SGLang.
·

The benchmark script runs without --quantization fp8, so the model loads in BF16. Rename the script, config key, and changelog entry to accurately reflect the precision. - Rename benchmarks/qwen3.5_fp8_b200.sh → qwen3.5_bf16_b200.sh - Update nvidia-master.yaml config key and precision field - Update perf-changelog.yaml references and PR link Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>

functionstackx · 2026-02-17T04:06:49Z

@claude review plz

Klaud-Cold · 2026-02-17T04:07:06Z

Claude finished @functionstackx's task in 1m 40s —— View job

Review of PR #704 — Qwen3.5-397B-A17B BF16 B200 SGLang Benchmark (STP only)

Read changed files and gather context
Read existing benchmarks for comparison
Review benchmarks/qwen3.5_bf16_b200.sh
Review nvidia-master.yaml changes
Review perf-changelog.yaml changes
Post review feedback

After the FP8→BF16 rename, the script now correctly represents what it does — launching SGLang without --quantization, which defaults to BF16 weights. There's one functional issue and a few consistency items to address.

Issues

1. Missing `EP_SIZE` env var check and `--ep-size` flag (Medium)

benchmarks/qwen3.5_bf16_b200.sh:5-12 — The script doesn't check for EP_SIZE in check_env_vars and doesn't pass --ep-size to SGLang. The nvidia-master.yaml config specifies ep: 1 in the search-space, which the CI workflow maps to EP_SIZE. Every other B200 SGLang benchmark (dsr1_fp8_b200.sh:13,74, dsr1_fp8_b200_mtp.sh:13,78, dsr1_fp4_b200.sh:13,41) checks for and passes this variable.

Qwen3.5-397B-A17B is a MoE model (17B active out of 397B parameters), so expert parallelism configuration matters. While EP_SIZE=1 with TP=8 means all experts are on every rank (no expert parallelism), explicitly passing it is important for:

Consistency with other benchmarks
benchmark_lib.sh:469 uses ${EP_SIZE:-1} in result metadata — it won't break, but explicitly setting it is cleaner
Future-proofing if someone adds ep: 8 search-space entries

# Add EP_SIZE to check_env_vars:
check_env_vars \
    MODEL \
    TP \
    CONC \
    ISL \
    OSL \
    RANDOM_RANGE_RATIO \
    RESULT_FILENAME \
    EP_SIZE

# Add to launch command:
    --ep-size $EP_SIZE \

Fix this →

2. Missing `--disable-radix-cache` flag (Minor)

benchmarks/qwen3.5_bf16_b200.sh:30-36 — All existing B200 SGLang benchmarks pass --disable-radix-cache for throughput benchmarking. The radix cache is for prefix-sharing optimization across requests, which adds overhead in throughput benchmarks with random prompts. Not having this flag means slightly different benchmarking conditions vs. all other B200 SGLang scripts.

    --disable-radix-cache \

3. Missing `--data-parallel-size=1` flag (Minor)

All other B200 SGLang benchmarks explicitly set --data-parallel-size=1. While this is likely the default, explicitly setting it maintains consistency and makes the configuration explicit.

Observations (Non-blocking)

Script is intentionally minimal

The script skips the TP-conditional tuning logic from dsr1_fp8_b200.sh (scheduler recv intervals, cuda-graph-max-bs, max-running-requests, chunked-prefill-size, etc.). This is reasonable for a first BF16 baseline — you'll want to tune these once initial numbers are in.

Other flags to consider adding later for performance tuning:

--cuda-graph-max-bs / --max-running-requests — memory optimization, prevents over-allocation
--chunked-prefill-size / --max-prefill-tokens — controls prefill chunking
--stream-interval — reduces streaming overhead
--enable-flashinfer-allreduce-fusion — may help with TP=8

conc-end: 64 for 8k1k may be aggressive

The DSR1 FP8 config uses conc-end: 4 for 8k1k at TP=8 (.github/configs/nvidia-master.yaml:1782). This Qwen3.5 config uses conc-end: 64 for all seq-lens including 8k1k. With BF16 weights (which use more memory than FP8), high concurrency at 8k input length may cause OOM. Consider reducing 8k1k conc-end to something more conservative initially.

`perf-changelog.yaml` and `nvidia-master.yaml` look correct

Config key qwen3.5-bf16-b200-sglang follows the {model}-{precision}-{runner}-{framework} convention
model-prefix: qwen3.5 correctly maps to benchmarks/qwen3.5_bf16_b200.sh via the {model-prefix}_{precision}_{runner}.sh naming pattern
PR link correctly set to #704
perf-changelog entry is well-structured

Summary

The EP_SIZE issue (#1) should be fixed before merge — it's a consistency requirement that every other B200 SGLang benchmark follows, and it matters for this MoE model. The --disable-radix-cache and --data-parallel-size=1 flags (#2, #3) are minor but would bring the script in line with other benchmarks. The rest is solid for an initial BF16 baseline.

|

functionstackx · 2026-02-17T04:07:48Z

PR title says "STP only" but there's no speculative decoding configured
The PR title says "(STP only)" but the script has no speculative decoding flags. The SGLang cookbook for Qwen3.5 recommends NEXTN speculative decoding:

--speculative-algo NEXTN --speculative-num-steps 3
--speculative-eagle-topk 1 --speculative-num-draft-tokens 4
If "STP" means "Standard Token Prediction" (i.e., no speculative decoding), this is fine and intentional — just noting the cookbook recommendation for a potential future MTP/STP variant.

@claude create/submit/open new PR from main that updates pr-claude-review.yml claude.yml and AGENT.md that says STP means single token prediction i.e. no spec decode or MTP

Klaud-Cold · 2026-02-17T04:08:08Z

Claude finished @functionstackx's task in 1m 58s —— View job

Adding STP definition to project docs

Read current files for context
Create new branch from main
Update AGENTS.md with STP/MTP terminology section
Update .github/workflows/claude-pr-review.yml with STP/MTP definitions
Update .github/workflows/claude.yml with STP/MTP definitions
Commit, push, and open PR

Done

Created PR #706: #706

Added STP/MTP terminology definitions to three files:

AGENTS.md — New "Terminology" section before "Key Technologies"
.github/workflows/claude-pr-review.yml — New "Terminology" section in the review prompt
.github/workflows/claude.yml — Added to "Additional Knowledge" section

STP = Single Token Prediction (standard autoregressive decoding, no speculative decoding or MTP)
MTP = Multi-Token Prediction (multiple tokens per forward pass via speculative decoding like EAGLE/NEXTN)
· Branch: docs/add-stp-definition

functionstackx · 2026-02-17T04:11:34Z

@kedarpotdar-nv validation passed here https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22084869512/job/63817339570

github-project-automation Bot added this to InferenceMAX Board Feb 16, 2026

functionstackx changed the title ~~Add Qwen3.5-397B-A17B FP8 B200 SGLang benchmark (STP only)~~ Add Qwen3.5-397B-A17B FP8 B200 SGLang benchmark (STP only) (might acutally be BF16) Feb 16, 2026

functionstackx added the sweep-enabled label Feb 16, 2026

functionstackx changed the title ~~Add Qwen3.5-397B-A17B FP8 B200 SGLang benchmark (STP only) (might acutally be BF16)~~ Add Qwen3.5-397B-A17B BF16 B200 SGLang benchmark (STP only) Feb 17, 2026

functionstackx removed the sweep-enabled label Feb 17, 2026

Klaud-Cold mentioned this pull request Feb 17, 2026

Add STP/MTP terminology definitions to agent docs #706

Merged

functionstackx requested a review from kedarpotdar-nv February 17, 2026 04:10

functionstackx merged commit 521d199 into main Feb 17, 2026
10 of 25 checks passed

functionstackx deleted the claude/issue-702-20260216-1859 branch February 17, 2026 04:11

github-project-automation Bot moved this to Done in InferenceMAX Board Feb 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3.5-397B-A17B BF16 B200 SGLang benchmark (STP only)#704

Add Qwen3.5-397B-A17B BF16 B200 SGLang benchmark (STP only)#704
functionstackx merged 2 commits intomainfrom
claude/issue-702-20260216-1859

functionstackx commented Feb 16, 2026

Uh oh!

functionstackx commented Feb 16, 2026

Uh oh!

functionstackx commented Feb 16, 2026

Uh oh!

Klaud-Cold commented Feb 16, 2026 •

edited

Loading

Uh oh!

functionstackx commented Feb 16, 2026

Uh oh!

functionstackx commented Feb 17, 2026

Uh oh!

Klaud-Cold commented Feb 17, 2026 •

edited

Loading

Uh oh!

functionstackx commented Feb 17, 2026

Uh oh!

Klaud-Cold commented Feb 17, 2026 •

edited

Loading

Uh oh!

functionstackx commented Feb 17, 2026

Uh oh!

Klaud-Cold commented Feb 17, 2026 •

edited

Loading

Uh oh!

functionstackx commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

functionstackx commented Feb 16, 2026

Uh oh!

functionstackx commented Feb 16, 2026

Uh oh!

functionstackx commented Feb 16, 2026

Uh oh!

Klaud-Cold commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #704 — Qwen3.5-397B-A17B FP8 B200 SGLang Benchmark

Summary

Issues

1. Missing --quantization fp8 flag (Bug)

2. Missing --kv-cache-dtype fp8_e4m3 flag

3. PR title says "STP only" but there's no speculative decoding configured

4. Missing EP_SIZE env var and --ep-size flag

Minor Issues

5. Missing performance tuning flags

6. perf-changelog.yaml — PR link placeholder

7. SGLang cookbook recommends SGLANG_USE_CUDA_IPC_TRANSPORT=1

What looks good

Recommendation

Uh oh!

functionstackx commented Feb 16, 2026

Uh oh!

functionstackx commented Feb 17, 2026

Uh oh!

Klaud-Cold commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Renaming FP8 → BF16 across benchmark files

Uh oh!

functionstackx commented Feb 17, 2026

Uh oh!

Klaud-Cold commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #704 — Qwen3.5-397B-A17B BF16 B200 SGLang Benchmark (STP only)

Issues

1. Missing EP_SIZE env var check and --ep-size flag (Medium)

2. Missing --disable-radix-cache flag (Minor)

3. Missing --data-parallel-size=1 flag (Minor)

Observations (Non-blocking)

Script is intentionally minimal

conc-end: 64 for 8k1k may be aggressive

perf-changelog.yaml and nvidia-master.yaml look correct

Summary

Uh oh!

functionstackx commented Feb 17, 2026

Uh oh!

Klaud-Cold commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding STP definition to project docs

Done

Uh oh!

functionstackx commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Klaud-Cold commented Feb 16, 2026 •

edited

Loading

1. Missing `--quantization fp8` flag (Bug)

2. Missing `--kv-cache-dtype fp8_e4m3` flag

4. Missing `EP_SIZE` env var and `--ep-size` flag

6. `perf-changelog.yaml` — PR link placeholder

7. SGLang cookbook recommends `SGLANG_USE_CUDA_IPC_TRANSPORT=1`

Klaud-Cold commented Feb 17, 2026 •

edited

Loading

Klaud-Cold commented Feb 17, 2026 •

edited

Loading

1. Missing `EP_SIZE` env var check and `--ep-size` flag (Medium)

2. Missing `--disable-radix-cache` flag (Minor)

3. Missing `--data-parallel-size=1` flag (Minor)

`perf-changelog.yaml` and `nvidia-master.yaml` look correct

Klaud-Cold commented Feb 17, 2026 •

edited

Loading