-
Notifications
You must be signed in to change notification settings - Fork 155
dsv4-b300-sglang: add conc=2048 recipe & MTP benchmark #1176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
148223d
c883e8d
3a49ed1
6f1b80a
1b34a8d
481482a
47fefec
bfa254d
287ef26
e4ddf8f
f64505b
4f468d6
fc93e84
4155a49
cea70e5
97a7e7d
628e47b
1526e9d
14369b1
42b294d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,149 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| # Tuning inputs from the matrix (all required): | ||
| # TP -- tensor parallel size -> --tp | ||
| # EP_SIZE -- expert parallel size -> --ep-size | ||
| # DP_ATTENTION -- "true" enables --enable-dp-attention --dp-size $TP | ||
| # Also selects MoE backend / chunked-prefill-size: | ||
| # true -> deepep + mega_moe + chunked-prefill 32768 | ||
| # false -> flashinfer_mxfp4 + chunked-prefill 8192 | ||
| # | ||
| # EAGLE/MTP speculative-decoding flags are hardcoded to (3, 1, 4): num-steps=3, | ||
| # eagle-topk=1, num-draft-tokens=4. Same chain across all CONC bands. | ||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| EP_SIZE \ | ||
| DP_ATTENTION \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| # The B300 runner overrides MODEL to a pre-staged /data/models path, so skip | ||
| # `hf download`. Only fetch when MODEL looks like a HF repo ID. | ||
| if [[ "$MODEL" != /* ]]; then | ||
| hf download "$MODEL" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| # Common SGLANG env vars (apply to every config). | ||
| export SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 | ||
| export SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT=1 | ||
| export SGLANG_OPT_USE_JIT_NORM=1 | ||
| export SGLANG_OPT_USE_JIT_INDEXER_METADATA=1 | ||
| export SGLANG_OPT_USE_TOPK_V2=1 | ||
| export SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1 | ||
|
|
||
| # TODO(Cam): the deepseek-v4 sglang images install sglang editable at | ||
| # /workspace/sglang/python; prior sglang tags used /sgl-workspace/sglang. | ||
| # The runner mounts our repo at a non-/workspace path for these images so the | ||
| # editable install stays visible. Paths in this script are $PWD-relative for | ||
| # that reason. Drop the runner conditional once lmsys moves sglang back out of | ||
| # /workspace. | ||
|
|
||
| SERVER_LOG="$PWD/server.log" | ||
| PORT=${PORT:-8888} | ||
|
|
||
| echo "TP: $TP, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION, CONC: $CONC, ISL: $ISL, OSL: $OSL" | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| fi | ||
|
|
||
| start_gpu_monitor --output "$PWD/gpu_metrics.csv" | ||
|
|
||
| # Recipe path is selected by DP_ATTENTION; MoE backend and chunked-prefill-size follow. | ||
| DEEPEP_CONFIG='{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}' | ||
|
|
||
| # MTP (EAGLE) speculative-decoding flags applied unconditionally on every recipe. | ||
| SPEC_FLAGS=( | ||
| --speculative-algorithm EAGLE | ||
| --speculative-num-steps 3 | ||
| --speculative-eagle-topk 1 | ||
| --speculative-num-draft-tokens 4 | ||
| ) | ||
|
|
||
| if [ "${DP_ATTENTION}" = "true" ]; then | ||
| # Large-batch EP path: deepep + mega_moe. | ||
| export SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1 | ||
| export SGLANG_OPT_FIX_HASH_MEGA_MOE=1 | ||
| export SGLANG_OPT_USE_FAST_MASK_EP=1 | ||
| export SGLANG_OPT_FIX_MEGA_MOE_MEMORY=1 | ||
| export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096 | ||
| export SGLANG_OPT_FIX_NEXTN_MEGA_MOE=1 | ||
| export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=0 | ||
| PARALLEL_ARGS=( | ||
| --dp-size "$TP" | ||
| --enable-dp-attention | ||
| --moe-a2a-backend deepep | ||
| --deepep-config "$DEEPEP_CONFIG" | ||
| ) | ||
| CHUNKED_PREFILL_SIZE=32768 | ||
| else | ||
| # Small-batch TP-only path: flashinfer_mxfp4. | ||
| PARALLEL_ARGS=( | ||
| --moe-runner-backend flashinfer_mxfp4 | ||
| --disable-flashinfer-autotune | ||
| ) | ||
| CHUNKED_PREFILL_SIZE=8192 | ||
| fi | ||
|
|
||
| # Print all SGLANG_* env vars to both the CI step log and server.log so the | ||
| # launch config is auditable from the result artifact alone. | ||
| { | ||
| echo "=== SGLANG_* env vars at launch ===" | ||
| env | grep -E '^SGLANG_' | sort | ||
| echo "===================================" | ||
| } | tee "$SERVER_LOG" | ||
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 sglang serve \ | ||
| --model-path $MODEL \ | ||
| --host 0.0.0.0 \ | ||
| --port $PORT \ | ||
| --trust-remote-code \ | ||
| --tp $TP \ | ||
| --ep-size $EP_SIZE \ | ||
| --chunked-prefill-size "$CHUNKED_PREFILL_SIZE" \ | ||
| --max-running-requests "$(( CONC * 3 / 2 > 8 ? CONC * 3 / 2 : 8 ))" \ | ||
| --mem-fraction-static 0.90 \ | ||
| --swa-full-tokens-ratio 0.1 \ | ||
| "${SPEC_FLAGS[@]}" \ | ||
| "${PARALLEL_ARGS[@]}" $EVAL_CONTEXT_ARGS >> $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts $((CONC * 10)) \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir "$PWD/" | ||
|
|
||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor | ||
| set +x | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1875,3 +1875,15 @@ | |
| - "better performance for dp-attention" | ||
| - "Recipes from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1174 | ||
|
|
||
| - config-keys: | ||
| - dsv4-fp4-b300-sglang-mtp | ||
| description: | ||
| - "Add DeepSeek-V4-Pro FP4 B300 SGLang benchmark with EAGLE/MTP speculative decoding" | ||
| - "Image: lmsysorg/sglang:deepseek-v4-b300@sha256:26e116bd211e300dbb76924d56c5cbe6cc3ee5ee2fe314859cb8774f5bc070f3 (pinned for deep_gemm transform_weights_for_mega_moe support; same digest as PR #1158)" | ||
| - "Model: deepseek-ai/DeepSeek-V4-Pro" | ||
| - "EAGLE/MTP flags hardcoded in script: num-steps=3, eagle-topk=1, num-draft-tokens=4" | ||
| - "Recipe (MoE backend, chunked-prefill) selected in script by dp-attn: TP-only + flashinfer_mxfp4 (small batch) vs DP-attn + deepep mega_moe (large batch)" | ||
| - "Three CONC bands: A=TP8 (1-8), B=TP4 (16-128), C=DP4 dp-attn (64-512); B/C overlap at conc 64,128" | ||
| - "Configs: 1k1k and 8k1k, no validation.py / launcher / yaml-field changes (knob-free)" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166 | ||
|
Comment on lines
+1879
to
+1889
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 The newly added Extended reasoning...What the bug is
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166But this PR is #1176 ("dsv4-b300-sglang: add conc=2048 recipe & MTP benchmark"). PR #1166 is a different, unrelated PR. The link should be Why this is wrongThe clear convention in A grep of Step-by-step proof
ImpactDocumentation/metadata only — no runtime behavior is affected. However, once merged, the changelog will permanently misattribute the introduction of How to fixChange line 1889 from: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166to: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1176 |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 The new MTP script hardcodes
--swa-full-tokens-ratio 0.1at line 121, while the parentdsv4_fp4_b300_sglang.shuses an ISL-conditional that picks 0.5 for ISL=1024 with the explicit comment that '0.5 was tuned empirically for the 1k1k recipe, while 0.1 is the cookbook default'. Since the MTP YAML exercises both 1024/1024 and 8192/1024, the 1k1k MTP run silently drops the empirical tuning — please either mirror the conditional or add a comment explaining why MTP intentionally diverges.Extended reasoning...
The divergence
The parent
benchmarks/single_node/dsv4_fp4_b300_sglang.sh(lines ~98-104 after this PR) setsSWA_FULL_TOKENS_RATIObased on ISL:The new
dsv4_fp4_b300_sglang_mtp.shat line 121 instead hardcodes:with no ISL branching.
Why this matters
The new MTP YAML config at
.github/configs/nvidia-master.yaml(lines ~1885-1893) exercises bothisl: 1024andisl: 8192sequence-length configs. So the 1k1k MTP run will use the cookbook default 0.1 instead of the empirically tuned 0.5 that the parent script's author specifically called out as needed for B300 cache headroom on 1k inputs.Step-by-step proof
dsv4-fp4-b300-sglang-mtpwith the band-A entry{ tp: 8, ep: 1, conc-start: 1, conc-end: 8 }againstisl: 1024, osl: 1024.dsv4_fp4_b300_sglang_mtp.shwithISL=1024.--swa-full-tokens-ratio 0.1tosglang serve.ISL=1024, would have passed0.5per its empirical tuning comment.Addressing the refutation
The refutation argues the MTP script is a deliberately distinct recipe, that EAGLE adds memory overhead favoring smaller SWA reservation, and that low concurrencies (band A only, per bug_002) make SWA pressure minimal. These are reasonable hypotheses, but they are hypotheses — the parent script's comment is an empirical claim the author already calibrated, and the MTP recipe inherits the rest of the parent's tuning surface (same model, same B300 hardware, same ISL=1024). If the divergence is intentional (e.g., EAGLE memory overhead changes the optimal SWA tradeoff), a one-line comment would document that and prevent future readers from assuming the omission was an oversight. The fact that the recipe author left no such comment, while the parent author did leave one specifically calling out the 1k vs 8k distinction, is exactly the signal that this deserves to be flagged.
Suggested fix
Either mirror the parent's ISL-conditional:
and substitute
--swa-full-tokens-ratio "$SWA_FULL_TOKENS_RATIO"at line 121, or add a brief comment at line 121 explaining why MTP intentionally uses the cookbook default (e.g., 'EAGLE draft-model + verification overhead favors smaller SWA reservation, so we use the cookbook default instead of the parent's 1k1k empirical 0.5').