Skip to content

Add Qwen3.5-397B-A17B BF16 B200 SGLang benchmark (STP only)#704

Merged
functionstackx merged 2 commits intomainfrom
claude/issue-702-20260216-1859
Feb 17, 2026
Merged

Add Qwen3.5-397B-A17B BF16 B200 SGLang benchmark (STP only)#704
functionstackx merged 2 commits intomainfrom
claude/issue-702-20260216-1859

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

Add Qwen3.5-397B-A17B FP8 B200 SGLang benchmark script and nvidia-master.yaml config STP only

  • New benchmark script benchmarks/qwen3.5_fp8_b200.sh with basic SGLang launch flags
  • Image: lmsysorg/sglang:nightly-dev-20260216-d3bae71e
  • TP=8, concurrency 4-64 for 1k1k, 1k8k, 8k1k
  • Updated perf-changelog.yaml

Closes #702

Generated with Claude Code

Simple benchmark script without speculative decoding or MTP flags:
- No SGLANG_USE_CUDA_IPC_TRANSPORT
- No --reasoning-parser qwen3
- No --speculative-algo NEXTN or related flags
- TP=8, --mem-fraction-static 0.8
- Image: lmsysorg/sglang:nightly-dev-20260216-d3bae71e

Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
@functionstackx
Copy link
Copy Markdown
Contributor Author

@pr-claude review please

@functionstackx
Copy link
Copy Markdown
Contributor Author

@claude review please

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Feb 16, 2026

Claude finished @functionstackx's task in 3m 21s —— View job


Review of PR #704 — Qwen3.5-397B-A17B FP8 B200 SGLang Benchmark

  • Read changed files and gather context
  • Review nvidia-master.yaml changes
  • Review benchmarks/qwen3.5_fp8_b200.sh
  • Review perf-changelog.yaml changes
  • Compare with existing benchmarks for consistency
  • Post review feedback

Summary

The PR adds a new benchmark config for Qwen3.5-397B-A17B FP8 on B200 with SGLang. The config YAML and perf-changelog entries look correct structurally. The main concern is that the benchmark script is very bare compared to existing B200 SGLang scripts and is missing the --quantization fp8 flag despite being labeled as an FP8 benchmark.


Issues

1. Missing --quantization fp8 flag (Bug)

benchmarks/qwen3.5_fp8_b200.sh:30-35 — The server launch command has no FP8 quantization flag. The config key is qwen3.5-fp8-b200-sglang and precision is fp8, but the script doesn't pass --quantization fp8 to SGLang. Compare with dsr1_fp8_b200.sh:74 which passes --quantization fp8. Without this, the model will load in BF16, not FP8, making the benchmark results misleading.

# Current (no quantization flag):
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server \
    --model-path=$MODEL \
    --host=0.0.0.0 \
    --port=$PORT \
    --tensor-parallel-size=$TP \
    --mem-fraction-static $MEM_FRAC_STATIC \
    > $SERVER_LOG 2>&1 &

# Should include:
    --quantization fp8 \

Fix this →

2. Missing --kv-cache-dtype fp8_e4m3 flag

All existing FP8 SGLang B200 benchmarks (dsr1_fp8_b200.sh:72, dsr1_fp8_b200_mtp.sh:69) include --kv-cache-dtype fp8_e4m3 to use FP8 KV cache. This is absent from the new script, which will default to BF16/FP16 KV cache and use significantly more memory.

3. PR title says "STP only" but there's no speculative decoding configured

The PR title says "(STP only)" but the script has no speculative decoding flags. The SGLang cookbook for Qwen3.5 recommends NEXTN speculative decoding:

--speculative-algo NEXTN --speculative-num-steps 3 \
--speculative-eagle-topk 1 --speculative-num-draft-tokens 4

If "STP" means "Standard Token Prediction" (i.e., no speculative decoding), this is fine and intentional — just noting the cookbook recommendation for a potential future MTP/STP variant.

4. Missing EP_SIZE env var and --ep-size flag

The nvidia-master.yaml config includes ep: 1 in the search space, but the benchmark script doesn't check for EP_SIZE in check_env_vars and doesn't pass --ep-size to the server. Compare with dsr1_fp8_b200.sh:13 which requires EP_SIZE and passes it at line 74. Qwen3.5-397B-A17B is a MoE model (A17B active out of 397B), so expert parallelism configuration is relevant.

Fix this →

Minor Issues

5. Missing performance tuning flags

Compared to the existing dsr1_fp8_b200.sh, the new script is missing several SGLang performance flags that other B200 benchmarks use:

  • --data-parallel-size=1
  • --cuda-graph-max-bs / --max-running-requests (memory optimization)
  • --chunked-prefill-size / --max-prefill-tokens
  • --disable-radix-cache
  • --stream-interval

The issue author explicitly asked to skip the complex TP-conditional tuning logic from dsr1_fp8_b200.sh, which is reasonable. However, some of these flags (especially --disable-radix-cache for throughput benchmarks) are likely still beneficial. This may be intentional for a baseline, but worth noting.

6. perf-changelog.yaml — PR link placeholder

perf-changelog.yaml:646 has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. Should be updated to pull/704.

Fix this →

7. SGLang cookbook recommends SGLANG_USE_CUDA_IPC_TRANSPORT=1

The SGLang Qwen3.5 cookbook recommends SGLANG_USE_CUDA_IPC_TRANSPORT=1 as an environment variable. This isn't used in other B200 scripts either, so it may not apply, but worth investigating.


What looks good

  • Config key naming follows the {model}-{precision}-{runner}-{framework} convention
  • model-prefix: qwen3.5 correctly matches the script name mapping (qwen3.5_fp8_b200.sh)
  • Seq-len configs (1k1k, 1k8k, 8k1k) with TP=8 and conc 4-64 are reasonable for initial benchmarking
  • perf-changelog.yaml entry is well-structured
  • Script correctly uses benchmark_lib.sh helpers

Recommendation

The --quantization fp8 and --kv-cache-dtype fp8_e4m3 flags are the most critical missing pieces — without them this isn't actually an FP8 benchmark. The EP_SIZE issue should also be addressed since Qwen3.5 is a MoE model. The missing performance tuning flags are less critical for an initial baseline but should be revisited once initial numbers are in.


@functionstackx functionstackx changed the title Add Qwen3.5-397B-A17B FP8 B200 SGLang benchmark (STP only) Add Qwen3.5-397B-A17B FP8 B200 SGLang benchmark (STP only) (might acutally be BF16) Feb 16, 2026
@functionstackx
Copy link
Copy Markdown
Contributor Author

prob gonna need to an small refactor since it is bf16

@functionstackx
Copy link
Copy Markdown
Contributor Author

@claude oops, the precision of the command i pasted is already bf16, can u change the file name to be bf16 and nvidia-master.yaml and perf changelog. yaml

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Feb 17, 2026

Claude finished @functionstackx's task in 1m 18s —— View job


Renaming FP8 → BF16 across benchmark files

  • Read current files for context
  • Rename benchmarks/qwen3.5_fp8_b200.shbenchmarks/qwen3.5_bf16_b200.sh
  • Update nvidia-master.yaml: config key qwen3.5-fp8-b200-sglangqwen3.5-bf16-b200-sglang, precision fp8bf16
  • Update perf-changelog.yaml: config key and description updated to BF16, PR link set to Add Qwen3.5-397B-A17B BF16 B200 SGLang benchmark (STP only) #704
  • Commit and push changes

All three files updated and pushed in commit ce477cf. The benchmark now correctly identifies as BF16 since the script doesn't pass --quantization fp8 to SGLang.
·

The benchmark script runs without --quantization fp8, so the model
loads in BF16. Rename the script, config key, and changelog entry
to accurately reflect the precision.

- Rename benchmarks/qwen3.5_fp8_b200.sh → qwen3.5_bf16_b200.sh
- Update nvidia-master.yaml config key and precision field
- Update perf-changelog.yaml references and PR link

Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
@functionstackx functionstackx changed the title Add Qwen3.5-397B-A17B FP8 B200 SGLang benchmark (STP only) (might acutally be BF16) Add Qwen3.5-397B-A17B BF16 B200 SGLang benchmark (STP only) Feb 17, 2026
@functionstackx
Copy link
Copy Markdown
Contributor Author

@claude review plz

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Feb 17, 2026

Claude finished @functionstackx's task in 1m 40s —— View job


Review of PR #704 — Qwen3.5-397B-A17B BF16 B200 SGLang Benchmark (STP only)

  • Read changed files and gather context
  • Read existing benchmarks for comparison
  • Review benchmarks/qwen3.5_bf16_b200.sh
  • Review nvidia-master.yaml changes
  • Review perf-changelog.yaml changes
  • Post review feedback

After the FP8→BF16 rename, the script now correctly represents what it does — launching SGLang without --quantization, which defaults to BF16 weights. There's one functional issue and a few consistency items to address.


Issues

1. Missing EP_SIZE env var check and --ep-size flag (Medium)

benchmarks/qwen3.5_bf16_b200.sh:5-12 — The script doesn't check for EP_SIZE in check_env_vars and doesn't pass --ep-size to SGLang. The nvidia-master.yaml config specifies ep: 1 in the search-space, which the CI workflow maps to EP_SIZE. Every other B200 SGLang benchmark (dsr1_fp8_b200.sh:13,74, dsr1_fp8_b200_mtp.sh:13,78, dsr1_fp4_b200.sh:13,41) checks for and passes this variable.

Qwen3.5-397B-A17B is a MoE model (17B active out of 397B parameters), so expert parallelism configuration matters. While EP_SIZE=1 with TP=8 means all experts are on every rank (no expert parallelism), explicitly passing it is important for:

  • Consistency with other benchmarks
  • benchmark_lib.sh:469 uses ${EP_SIZE:-1} in result metadata — it won't break, but explicitly setting it is cleaner
  • Future-proofing if someone adds ep: 8 search-space entries
# Add EP_SIZE to check_env_vars:
check_env_vars \
    MODEL \
    TP \
    CONC \
    ISL \
    OSL \
    RANDOM_RANGE_RATIO \
    RESULT_FILENAME \
    EP_SIZE

# Add to launch command:
    --ep-size $EP_SIZE \

Fix this →

2. Missing --disable-radix-cache flag (Minor)

benchmarks/qwen3.5_bf16_b200.sh:30-36 — All existing B200 SGLang benchmarks pass --disable-radix-cache for throughput benchmarking. The radix cache is for prefix-sharing optimization across requests, which adds overhead in throughput benchmarks with random prompts. Not having this flag means slightly different benchmarking conditions vs. all other B200 SGLang scripts.

    --disable-radix-cache \

3. Missing --data-parallel-size=1 flag (Minor)

All other B200 SGLang benchmarks explicitly set --data-parallel-size=1. While this is likely the default, explicitly setting it maintains consistency and makes the configuration explicit.


Observations (Non-blocking)

Script is intentionally minimal

The script skips the TP-conditional tuning logic from dsr1_fp8_b200.sh (scheduler recv intervals, cuda-graph-max-bs, max-running-requests, chunked-prefill-size, etc.). This is reasonable for a first BF16 baseline — you'll want to tune these once initial numbers are in.

Other flags to consider adding later for performance tuning:

  • --cuda-graph-max-bs / --max-running-requests — memory optimization, prevents over-allocation
  • --chunked-prefill-size / --max-prefill-tokens — controls prefill chunking
  • --stream-interval — reduces streaming overhead
  • --enable-flashinfer-allreduce-fusion — may help with TP=8

conc-end: 64 for 8k1k may be aggressive

The DSR1 FP8 config uses conc-end: 4 for 8k1k at TP=8 (.github/configs/nvidia-master.yaml:1782). This Qwen3.5 config uses conc-end: 64 for all seq-lens including 8k1k. With BF16 weights (which use more memory than FP8), high concurrency at 8k input length may cause OOM. Consider reducing 8k1k conc-end to something more conservative initially.

perf-changelog.yaml and nvidia-master.yaml look correct

  • Config key qwen3.5-bf16-b200-sglang follows the {model}-{precision}-{runner}-{framework} convention
  • model-prefix: qwen3.5 correctly maps to benchmarks/qwen3.5_bf16_b200.sh via the {model-prefix}_{precision}_{runner}.sh naming pattern
  • PR link correctly set to #704
  • perf-changelog entry is well-structured

Summary

The EP_SIZE issue (#1) should be fixed before merge — it's a consistency requirement that every other B200 SGLang benchmark follows, and it matters for this MoE model. The --disable-radix-cache and --data-parallel-size=1 flags (#2, #3) are minor but would bring the script in line with other benchmarks. The rest is solid for an initial BF16 baseline.


|

@functionstackx
Copy link
Copy Markdown
Contributor Author

  1. PR title says "STP only" but there's no speculative decoding configured
    The PR title says "(STP only)" but the script has no speculative decoding flags. The SGLang cookbook for Qwen3.5 recommends NEXTN speculative decoding:

--speculative-algo NEXTN --speculative-num-steps 3
--speculative-eagle-topk 1 --speculative-num-draft-tokens 4
If "STP" means "Standard Token Prediction" (i.e., no speculative decoding), this is fine and intentional — just noting the cookbook recommendation for a potential future MTP/STP variant.

@claude create/submit/open new PR from main that updates pr-claude-review.yml claude.yml and AGENT.md that says STP means single token prediction i.e. no spec decode or MTP

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Feb 17, 2026

Claude finished @functionstackx's task in 1m 58s —— View job


Adding STP definition to project docs

  • Read current files for context
  • Create new branch from main
  • Update AGENTS.md with STP/MTP terminology section
  • Update .github/workflows/claude-pr-review.yml with STP/MTP definitions
  • Update .github/workflows/claude.yml with STP/MTP definitions
  • Commit, push, and open PR

Done

Created PR #706: #706

Added STP/MTP terminology definitions to three files:

  • AGENTS.md — New "Terminology" section before "Key Technologies"
  • .github/workflows/claude-pr-review.yml — New "Terminology" section in the review prompt
  • .github/workflows/claude.yml — Added to "Additional Knowledge" section

STP = Single Token Prediction (standard autoregressive decoding, no speculative decoding or MTP)
MTP = Multi-Token Prediction (multiple tokens per forward pass via speculative decoding like EAGLE/NEXTN)
· Branch: docs/add-stp-definition

@functionstackx
Copy link
Copy Markdown
Contributor Author

@functionstackx functionstackx merged commit 521d199 into main Feb 17, 2026
10 of 25 checks passed
@functionstackx functionstackx deleted the claude/issue-702-20260216-1859 branch February 17, 2026 04:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

add B200 Qwen SGLang BF16

2 participants