Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
148223d
sglang dsv4 mtp
hnyls2002 Apr 26, 2026
c883e8d
knob-driven recipe selection
hnyls2002 Apr 26, 2026
3a49ed1
self-contained mtp config; recipe via dp-attn
hnyls2002 Apr 26, 2026
6f1b80a
add mtp_1 (1/1/2) variant
hnyls2002 Apr 26, 2026
1b34a8d
knob-driven recipe selection
hnyls2002 Apr 26, 2026
481482a
pin sglang image to mega_moe-capable digest
hnyls2002 Apr 26, 2026
47fefec
drop mtp_1 knob; align with PR #1158 image digest
hnyls2002 Apr 26, 2026
bfa254d
Merge branch 'main' into sglang-dsv4-MTP
Oseltamivir Apr 26, 2026
287ef26
update nvidia-master.yaml
yhyang201 Apr 26, 2026
e4ddf8f
Merge branch 'main' into sglang-dsv4-MTP
yhyang201 Apr 26, 2026
f64505b
fix: restore trailing newline in perf-changelog.yaml
yhyang201 Apr 26, 2026
4f468d6
fix: remove --use-chat-template and floor --max-running-requests at 8
yhyang201 Apr 26, 2026
fc93e84
perf-changelog: add dsv4-fp4-b300-sglang-mtp entry
yhyang201 Apr 26, 2026
4155a49
merge main and resolve perf-changelog.yaml conflict
yhyang201 Apr 26, 2026
cea70e5
dsv4-b300-sglang: add conc=2048 8k1k recipe with finite request-rate
yhyang201 Apr 26, 2026
97a7e7d
dsv4-b300-sglang: temporarily keep only conc=2048 8k1k for experiment
yhyang201 Apr 26, 2026
628e47b
Revert "dsv4-b300-sglang: temporarily keep only conc=2048 8k1k for ex…
yhyang201 Apr 26, 2026
1526e9d
Revert "dsv4-b300-sglang: add conc=2048 8k1k recipe with finite reque…
yhyang201 Apr 26, 2026
14369b1
dsv4-b300-sglang-mtp: tune EAGLE spec params from (3,1,4) to (4,1,5)
yhyang201 Apr 26, 2026
42b294d
Revert "dsv4-b300-sglang-mtp: tune EAGLE spec params from (3,1,4) to …
yhyang201 Apr 26, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1867,6 +1867,36 @@ dsv4-fp4-b300-sglang:
- { tp: 4, ep: 1, conc-start: 32, conc-end: 32 }
- { tp: 4, ep: 4, dp-attn: true, conc-start: 512, conc-end: 512 }

# DeepSeek-V4-Pro on B300 with EAGLE/MTP speculative decoding. Recipe is
# selected inside benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh by
# DP_ATTENTION:
# dp-attn: false -> TP-only + flashinfer_mxfp4 + chunked-prefill 8192
# dp-attn: true -> DP-attn + deepep mega_moe + chunked-prefill 32768
# `ep` is implicit in sglang: --moe-a2a-backend deepep forces ep_size=tp_size,
# while the TP-only path leaves ep_size at the default of 1.
dsv4-fp4-b300-sglang-mtp:
image: lmsysorg/sglang:deepseek-v4-b300@sha256:26e116bd211e300dbb76924d56c5cbe6cc3ee5ee2fe314859cb8774f5bc070f3
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: b300
precision: fp4
framework: sglang
multinode: false
# Three CONC bands sweep with EAGLE/MTP (3/1/4) on top:
# A: TP=8 ep=1 -- conc 1-8 (latency-bound, full TP)
# B: TP=4 ep=1 -- conc 16-128 (TP-only, mid batch)
# C: TP=4 ep=4 dp-attn -- conc 64-512 (DP-attn + EP, large batch)
# Overlap: B/C at conc 64,128 (TP-only vs DP-attn EP head-to-head).
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 1, conc-end: 8, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 1, conc-end: 8, spec-decoding: mtp }

qwen3.5-bf16-b200-sglang:
image: lmsysorg/sglang:nightly-dev-20260216-d3bae71e
model: Qwen/Qwen3.5-397B-A17B
Expand Down
149 changes: 149 additions & 0 deletions benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

# Tuning inputs from the matrix (all required):
# TP -- tensor parallel size -> --tp
# EP_SIZE -- expert parallel size -> --ep-size
# DP_ATTENTION -- "true" enables --enable-dp-attention --dp-size $TP
# Also selects MoE backend / chunked-prefill-size:
# true -> deepep + mega_moe + chunked-prefill 32768
# false -> flashinfer_mxfp4 + chunked-prefill 8192
#
# EAGLE/MTP speculative-decoding flags are hardcoded to (3, 1, 4): num-steps=3,
# eagle-topk=1, num-draft-tokens=4. Same chain across all CONC bands.
check_env_vars \
MODEL \
TP \
EP_SIZE \
DP_ATTENTION \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

# The B300 runner overrides MODEL to a pre-staged /data/models path, so skip
# `hf download`. Only fetch when MODEL looks like a HF repo ID.
if [[ "$MODEL" != /* ]]; then
hf download "$MODEL"
fi

nvidia-smi

# Common SGLANG env vars (apply to every config).
export SGLANG_JIT_DEEPGEMM_PRECOMPILE=0
export SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT=1
export SGLANG_OPT_USE_JIT_NORM=1
export SGLANG_OPT_USE_JIT_INDEXER_METADATA=1
export SGLANG_OPT_USE_TOPK_V2=1
export SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1

# TODO(Cam): the deepseek-v4 sglang images install sglang editable at
# /workspace/sglang/python; prior sglang tags used /sgl-workspace/sglang.
# The runner mounts our repo at a non-/workspace path for these images so the
# editable install stays visible. Paths in this script are $PWD-relative for
# that reason. Drop the runner conditional once lmsys moves sglang back out of
# /workspace.

SERVER_LOG="$PWD/server.log"
PORT=${PORT:-8888}

echo "TP: $TP, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION, CONC: $CONC, ISL: $ISL, OSL: $OSL"

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi

start_gpu_monitor --output "$PWD/gpu_metrics.csv"

# Recipe path is selected by DP_ATTENTION; MoE backend and chunked-prefill-size follow.
DEEPEP_CONFIG='{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}'

# MTP (EAGLE) speculative-decoding flags applied unconditionally on every recipe.
SPEC_FLAGS=(
--speculative-algorithm EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
)

if [ "${DP_ATTENTION}" = "true" ]; then
# Large-batch EP path: deepep + mega_moe.
export SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1
export SGLANG_OPT_FIX_HASH_MEGA_MOE=1
export SGLANG_OPT_USE_FAST_MASK_EP=1
export SGLANG_OPT_FIX_MEGA_MOE_MEMORY=1
export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096
export SGLANG_OPT_FIX_NEXTN_MEGA_MOE=1
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=0
PARALLEL_ARGS=(
--dp-size "$TP"
--enable-dp-attention
--moe-a2a-backend deepep
--deepep-config "$DEEPEP_CONFIG"
)
CHUNKED_PREFILL_SIZE=32768
else
# Small-batch TP-only path: flashinfer_mxfp4.
PARALLEL_ARGS=(
--moe-runner-backend flashinfer_mxfp4
--disable-flashinfer-autotune
)
CHUNKED_PREFILL_SIZE=8192
fi

# Print all SGLANG_* env vars to both the CI step log and server.log so the
# launch config is auditable from the result artifact alone.
{
echo "=== SGLANG_* env vars at launch ==="
env | grep -E '^SGLANG_' | sort
echo "==================================="
} | tee "$SERVER_LOG"

set -x
PYTHONNOUSERSITE=1 sglang serve \
--model-path $MODEL \
--host 0.0.0.0 \
--port $PORT \
--trust-remote-code \
--tp $TP \
--ep-size $EP_SIZE \
--chunked-prefill-size "$CHUNKED_PREFILL_SIZE" \
--max-running-requests "$(( CONC * 3 / 2 > 8 ? CONC * 3 / 2 : 8 ))" \
--mem-fraction-static 0.90 \
--swa-full-tokens-ratio 0.1 \
"${SPEC_FLAGS[@]}" \
Comment on lines +119 to +122
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new MTP script hardcodes --swa-full-tokens-ratio 0.1 at line 121, while the parent dsv4_fp4_b300_sglang.sh uses an ISL-conditional that picks 0.5 for ISL=1024 with the explicit comment that '0.5 was tuned empirically for the 1k1k recipe, while 0.1 is the cookbook default'. Since the MTP YAML exercises both 1024/1024 and 8192/1024, the 1k1k MTP run silently drops the empirical tuning — please either mirror the conditional or add a comment explaining why MTP intentionally diverges.

Extended reasoning...

The divergence

The parent benchmarks/single_node/dsv4_fp4_b300_sglang.sh (lines ~98-104 after this PR) sets SWA_FULL_TOKENS_RATIO based on ISL:

# 1k inputs need more SWA cache headroom on B300 than 8k inputs do; 0.5 was
# tuned empirically for the 1k1k recipe, while 0.1 is the cookbook default.
if [[ "$ISL" == "1024" ]]; then
    SWA_FULL_TOKENS_RATIO=0.5
else
    SWA_FULL_TOKENS_RATIO=0.1
fi

The new dsv4_fp4_b300_sglang_mtp.sh at line 121 instead hardcodes:

--swa-full-tokens-ratio 0.1 \

with no ISL branching.

Why this matters

The new MTP YAML config at .github/configs/nvidia-master.yaml (lines ~1885-1893) exercises both isl: 1024 and isl: 8192 sequence-length configs. So the 1k1k MTP run will use the cookbook default 0.1 instead of the empirically tuned 0.5 that the parent script's author specifically called out as needed for B300 cache headroom on 1k inputs.

Step-by-step proof

  1. CI launches dsv4-fp4-b300-sglang-mtp with the band-A entry { tp: 8, ep: 1, conc-start: 1, conc-end: 8 } against isl: 1024, osl: 1024.
  2. The launcher invokes dsv4_fp4_b300_sglang_mtp.sh with ISL=1024.
  3. At line 121, the script unconditionally passes --swa-full-tokens-ratio 0.1 to sglang serve.
  4. The parent script, given the same ISL=1024, would have passed 0.5 per its empirical tuning comment.
  5. Result: the 1k1k MTP sweep runs with the cookbook default the parent script's author explicitly flagged as suboptimal for B300 1k inputs.

Addressing the refutation

The refutation argues the MTP script is a deliberately distinct recipe, that EAGLE adds memory overhead favoring smaller SWA reservation, and that low concurrencies (band A only, per bug_002) make SWA pressure minimal. These are reasonable hypotheses, but they are hypotheses — the parent script's comment is an empirical claim the author already calibrated, and the MTP recipe inherits the rest of the parent's tuning surface (same model, same B300 hardware, same ISL=1024). If the divergence is intentional (e.g., EAGLE memory overhead changes the optimal SWA tradeoff), a one-line comment would document that and prevent future readers from assuming the omission was an oversight. The fact that the recipe author left no such comment, while the parent author did leave one specifically calling out the 1k vs 8k distinction, is exactly the signal that this deserves to be flagged.

Suggested fix

Either mirror the parent's ISL-conditional:

if [[ "$ISL" == "1024" ]]; then
    SWA_FULL_TOKENS_RATIO=0.5
else
    SWA_FULL_TOKENS_RATIO=0.1
fi

and substitute --swa-full-tokens-ratio "$SWA_FULL_TOKENS_RATIO" at line 121, or add a brief comment at line 121 explaining why MTP intentionally uses the cookbook default (e.g., 'EAGLE draft-model + verification overhead favors smaller SWA reservation, so we use the cookbook default instead of the parent's 1k1k empirical 0.5').

"${PARALLEL_ARGS[@]}" $EVAL_CONTEXT_ARGS >> $SERVER_LOG 2>&1 &

SERVER_PID=$!

wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts $((CONC * 10)) \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir "$PWD/"

if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
set +x
12 changes: 12 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1875,3 +1875,15 @@
- "better performance for dp-attention"
- "Recipes from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1174

- config-keys:
- dsv4-fp4-b300-sglang-mtp
description:
- "Add DeepSeek-V4-Pro FP4 B300 SGLang benchmark with EAGLE/MTP speculative decoding"
- "Image: lmsysorg/sglang:deepseek-v4-b300@sha256:26e116bd211e300dbb76924d56c5cbe6cc3ee5ee2fe314859cb8774f5bc070f3 (pinned for deep_gemm transform_weights_for_mega_moe support; same digest as PR #1158)"
- "Model: deepseek-ai/DeepSeek-V4-Pro"
- "EAGLE/MTP flags hardcoded in script: num-steps=3, eagle-topk=1, num-draft-tokens=4"
- "Recipe (MoE backend, chunked-prefill) selected in script by dp-attn: TP-only + flashinfer_mxfp4 (small batch) vs DP-attn + deepep mega_moe (large batch)"
- "Three CONC bands: A=TP8 (1-8), B=TP4 (16-128), C=DP4 dp-attn (64-512); B/C overlap at conc 64,128"
- "Configs: 1k1k and 8k1k, no validation.py / launcher / yaml-field changes (knob-free)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166
Comment on lines +1879 to +1889
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The newly added dsv4-fp4-b300-sglang-mtp changelog entry sets pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166, but this is PR #1176; PR #1166 is an unrelated open PR. Likely a digit transposition — please change the link to /pull/1176 so the changelog correctly attributes the entry once merged.

Extended reasoning...

What the bug is

perf-changelog.yaml lines 1879-1889 add a new entry for the dsv4-fp4-b300-sglang-mtp config (the new MTP benchmark introduced by this PR). Its pr-link is set to:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166

But this PR is #1176 ("dsv4-b300-sglang: add conc=2048 recipe & MTP benchmark"). PR #1166 is a different, unrelated PR. The link should be /pull/1176.

Why this is wrong

The clear convention in perf-changelog.yaml is that pr-link points to the PR that introduces the entry — every prior entry in the file follows this pattern (e.g., the entry above this one points to PR #1174, which was the merged PR introducing it; the entry being added here is the one this PR adds and so should point to #1176).

A grep of perf-changelog.yaml shows pull/1166 only appears at this new entry on line 1889, and pull/1176 does not appear anywhere in the file — so the typo is not duplicated and there is no other entry that should already be carrying #1176.

Step-by-step proof

  1. Open the PR diff for perf-changelog.yaml. The entire 11-line block at lines 1879-1889 is being newly added (no - lines, only +), and the new config-keys is dsv4-fp4-b300-sglang-mtp.
  2. Look up the config: dsv4-fp4-b300-sglang-mtp is also added by this same PR in .github/configs/nvidia-master.yaml (lines 1875+) and by the new script benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh. The description in the changelog ("Three CONC bands: A=TP8 (1-8), B=TP4 (16-128), C=DP4 dp-attn (64-512)", EAGLE/MTP flags 3/1/4, image digest sha256:26e116bd…f5bc070f3) matches exactly what this PR adds in those files. So the entry is unambiguously the one being introduced by this PR.
  3. The PR number is 1176 (per the PR metadata <pr number="1176">). PR sglang dsv4 MTP #1166 is a different PR (per the verifier check: an open PR titled 'sglang dsv4 MTP', not yet merged). They share the digits "116", consistent with a transposition typo.
  4. Convention check: every other recent entry in this file uses pr-link matching the PR that adds it (e.g., the entry directly above this one at line 1877 is /pull/1174, matching the prior commit fc93e84; the commit message of fc93e84 itself even says "append the MTP config entry for PR sglang dsv4 MTP #1166" — confirming 1166 was a placeholder/wrong number rather than an intentional cross-reference).

Impact

Documentation/metadata only — no runtime behavior is affected. However, once merged, the changelog will permanently misattribute the introduction of dsv4-fp4-b300-sglang-mtp to PR #1166 (an unrelated PR by a different author), so anyone clicking through to find the introducing PR will land on the wrong page. Easy to fix before merge.

How to fix

Change line 1889 from:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1166

to:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1176

Loading