Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
26e540d
feat: add DeepSeek-V4-Flash FP4 B300 SGLang benchmark
cquil11 Apr 24, 2026
efdc8ba
fix: switch dsv4-fp4-b300-sglang to Pro + Max-Throughput recipe
cquil11 Apr 24, 2026
cc35a12
chore: sync launch_b200-dgxc-slurm.sh cache mount from claude/add-dsv…
cquil11 Apr 24, 2026
404a097
fix: restore trailing whitespace stripped from glm5.1 changelog entry
cquil11 Apr 24, 2026
97a488e
chore: add flock-guarded squash import to B300 runner
cquil11 Apr 24, 2026
106deea
fix: drop ENROOT_CACHE_PATH override from B300 runner
cquil11 Apr 24, 2026
4bb1f1a
chore: point B300 runner at shared gharunners/{squash,hf-hub-cache}
cquil11 Apr 24, 2026
744c5a0
fix: move enroot import out of srun to avoid pyxis namespace collision
cquil11 Apr 24, 2026
d003c59
fix: wipe stale pyxis scratch dirs for this JOB_ID before benchmark srun
cquil11 Apr 24, 2026
f00629f
Revert: drop all B300 runner changes, mirror #1128's approach
cquil11 Apr 24, 2026
570b0eb
runner: add head-node flock-guarded squash import on B300
cquil11 Apr 24, 2026
864419d
fix: mount at /ix and clear baked-in CUDA_VISIBLE_DEVICES
cquil11 Apr 24, 2026
5d93913
Merge branch 'main' into chore/dsv4-sgl-b300
cquil11 Apr 24, 2026
9453676
runner: use /data/models pre-staged path for dsv4 on B300
cquil11 Apr 24, 2026
5db43b8
fix: switch B300 dsv4 sglang to bw-ultra-compiled image
cquil11 Apr 24, 2026
c060c58
fix: switch B300 dsv4 sglang image to yhyang201/sglang-b300:v3
cquil11 Apr 24, 2026
08edf26
update b300
cquil11 Apr 24, 2026
a699ca0
feat(dsv4-fp4-b300-sglang): pick recipe by CONC; split search-space
cquil11 Apr 24, 2026
d35696c
update b300
cquil11 Apr 24, 2026
bc43672
feat(dsv4-fp4-b300-sglang): hardcode low-latency recipe at every CONC
cquil11 Apr 24, 2026
87c8376
trigger test check
cquil11 Apr 25, 2026
aa423f0
Merge branch 'main' into chore/dsv4-sgl-b300
cquil11 Apr 25, 2026
90e8f3d
Revert "feat(dsv4-fp4-b300-sglang): hardcode low-latency recipe at ev…
cquil11 Apr 25, 2026
8e3158d
trigger test check
cquil11 Apr 25, 2026
78c2dae
Add B300 config: dsv4-fp4-sglang-mtp
cquil11 Apr 25, 2026
eb35ba1
Tighten dsv4 b300 sglang yaml: drop MTP max-throughput rows, fix stal…
cquil11 Apr 25, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 54 additions & 12 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1832,9 +1832,10 @@ dsr1-fp8-b300-sglang:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 4 }
- { tp: 4, ep: 1, conc-start: 4, conc-end: 32 }

# NOTE: Low-latency fallback (TP=8, EP=1, no DP-attn, no DeepEP) while
# the DeepEP FP8 weight-postprocess path is broken for DeepSeek-V4-Pro
# on B300. Re-introduce balanced/max-throughput rows once fixed upstream.
# NOTE: https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4
# lists B200 (not B300) as the Blackwell target; we reuse the B200 Pro FP4
# recipes on B300 until a B300-specific recipe ships. Prefix caching is
# disabled. Parallelisms mirror dsv4-fp4-b200-sglang.
dsv4-fp4-b300-sglang:
image: lmsysorg/sglang:deepseek-v4-b300
model: deepseek-ai/DeepSeek-V4-Pro
Expand All @@ -1843,22 +1844,63 @@ dsv4-fp4-b300-sglang:
precision: fp4
framework: sglang
multinode: false
# TODO(Cam): low-latency recipe only (TP-only, no DP-attn, no DeepEP)
# while the DeepEP FP8 weight-postprocess path is broken for this
# checkpoint on B300 (RuntimeError: Recipe must be a list/tuple of 3
# integers. raised from sglang.srt.layers.quantization.fp8
# .process_weights_after_loading_block_quant). Full concurrency sweep
# retained; revert to the recipe-per-CONC split on chore/dsv4-sgl-b300
# once sglang can load the checkpoint under --moe-a2a-backend deepep.
# Three recipes from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4
# are selected inside benchmarks/single_node/dsv4_fp4_b300.sh by CONC:
# low-latency (CONC <= 32): TP-only
# balanced (32 < CONC <= 128): + DP-attn
# max-throughput (CONC > 128): + DP-attn
# Split so result filenames (ep=, dpa=) accurately reflect the recipe.
# ep is implicit in sglang: --moe-a2a-backend deepep forces ep_size=tp_size,
# while low-latency leaves ep_size at the default of 1.
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
# low-latency
- { tp: 8, ep: 1, conc-start: 4, conc-end: 32 }
# balanced
- { tp: 8, ep: 8, dp-attn: true, conc-start: 64, conc-end: 128 }
# max-throughput
- { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 1024 }
- isl: 8192
osl: 1024
search-space:
# low-latency
- { tp: 8, ep: 1, conc-start: 4, conc-end: 32 }
# balanced
- { tp: 8, ep: 8, dp-attn: true, conc-start: 64, conc-end: 128 }
# max-throughput
- { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 512 }

dsv4-fp4-b300-sglang-mtp:
image: lmsysorg/sglang:deepseek-v4-b300
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: b300
precision: fp4
framework: sglang
multinode: false
# Mirrors dsv4-fp4-b300-sglang's low-latency and balanced rows with EAGLE
# MTP enabled per https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4:
# low-latency (CONC <= 32): EAGLE 3 steps / 4 draft tokens
# balanced (32 < CONC <= 128): EAGLE 1 step / 2 draft tokens
# Max-throughput is intentionally omitted -- the cookbook says MTP off
# at saturation because the verify step costs more than it saves.
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 1024 }
# low-latency
- { tp: 8, ep: 1, conc-start: 4, conc-end: 32, spec-decoding: mtp }
# balanced
- { tp: 8, ep: 8, dp-attn: true, conc-start: 64, conc-end: 128, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 512 }
# low-latency
- { tp: 8, ep: 1, conc-start: 4, conc-end: 32, spec-decoding: mtp }
# balanced
- { tp: 8, ep: 8, dp-attn: true, conc-start: 64, conc-end: 128, spec-decoding: mtp }

qwen3.5-bf16-b200-sglang:
image: lmsysorg/sglang:nightly-dev-20260216-d3bae71e
Expand Down
129 changes: 129 additions & 0 deletions benchmarks/single_node/dsv4_fp4_b300.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

# The B300 runner overrides MODEL to a pre-staged /data/models path, so skip
# `hf download`. Only fetch when MODEL looks like a HF repo ID.
if [[ "$MODEL" != /* ]]; then
hf download "$MODEL"
fi

nvidia-smi

export SGLANG_JIT_DEEPGEMM_PRECOMPILE=0

# The deepseek-v4 sglang images (lmsysorg/sglang:deepseek-v4-blackwell and its
# B300 forks) bake CUDA_VISIBLE_DEVICES=4,5,6,7 into their ENV, which masks half
# of the 8 GPUs Slurm allocates us. Clear it so TP=8 can bind to all ranks.
unset CUDA_VISIBLE_DEVICES

# TODO(Cam): the deepseek-v4 sglang images install sglang editable at
# /workspace/sglang/python; prior sglang tags used /sgl-workspace/sglang.
# The runner mounts our repo at a non-/workspace path for these images so the
# editable install stays visible. Paths in this script are $PWD-relative for
# that reason. Drop the runner conditional once lmsys moves sglang back out of
# /workspace.

SERVER_LOG="$PWD/server.log"
PORT=${PORT:-8888}

echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL"

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi

start_gpu_monitor --output "$PWD/gpu_metrics.csv"

# Three recipes from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4
# (spec-decoding / MTP and prefix-caching flags dropped for the baseline):
# - low-latency (CONC <= 32): TP-only, chunked-prefill, disable autotune
# - balanced (32 < CONC <= 128): + DP-attn, max-running-requests=128
# - max-throughput (CONC > 128): + DP-attn, max-running-requests=256
DEEPEP_CONFIG='{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}'

if [[ $CONC -le 32 ]]; then
RECIPE=low-latency
RECIPE_FLAGS=(
--moe-runner-backend flashinfer_mxfp4
--chunked-prefill-size 4096
--disable-flashinfer-autotune
--mem-fraction-static 0.82
)
elif [[ $CONC -le 128 ]]; then
RECIPE=balanced
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256
RECIPE_FLAGS=(
--dp-size "$TP"
--enable-dp-attention
--moe-a2a-backend deepep
--deepep-config "$DEEPEP_CONFIG"
--mem-fraction-static 0.82
--cuda-graph-max-bs 64
--max-running-requests 128
)
else
RECIPE=max-throughput
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256
RECIPE_FLAGS=(
--dp-size "$TP"
--enable-dp-attention
--moe-a2a-backend deepep
--deepep-config "$DEEPEP_CONFIG"
--mem-fraction-static 0.82
--cuda-graph-max-bs 64
--max-running-requests 256
)
fi
echo "Recipe: $RECIPE (CONC=$CONC)"

set -x
PYTHONNOUSERSITE=1 sglang serve \
--model-path $MODEL \
--host 0.0.0.0 \
--port $PORT \
--trust-remote-code \
--tp $TP \
--disable-radix-cache \
"${RECIPE_FLAGS[@]}" $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts $((CONC * 10)) \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir "$PWD/"

if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
set +x
130 changes: 130 additions & 0 deletions benchmarks/single_node/dsv4_fp4_b300_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

Comment on lines +1 to +4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new benchmarks/single_node/dsv4_fp4_b300.sh is dead code for the dsv4-fp4-b300-sglang config: benchmarks/single_node/dsv4_fp4_b300_sglang.sh (added by PR #1146) still exists with the hardcoded low-latency recipe, and runners/launch_b300-nv.sh:267-272 picks the framework-tagged file first and only falls back to the bare name when missing. The newly added balanced and max-throughput rows (tp=8 ep=8 dp-attn=true, conc 64-1024 / 64-512) will execute the stale low-latency recipe but produce result filenames tagged ep=8/dpa=true, mislabelling the data. Fix: rename dsv4_fp4_b300.shdsv4_fp4_b300_sglang.sh (overwriting), or delete the stale dsv4_fp4_b300_sglang.sh. The MTP variant is unaffected because dsv4_fp4_b300_sglang_mtp.sh does not exist.

Extended reasoning...

What goes wrong

This PR adds two new benchmark scripts:

  • benchmarks/single_node/dsv4_fp4_b300.sh — recipe-per-CONC dispatch (low-latency / balanced / max-throughput)
  • benchmarks/single_node/dsv4_fp4_b300_mtp.sh — MTP variant of the same dispatch

It also expands the dsv4-fp4-b300-sglang YAML from a single TP-only (tp:8, ep:1, conc 4-1024/4-512) row into three recipes per seq-len, with the inline comment "are selected inside benchmarks/single_node/dsv4_fp4_b300.sh by CONC".

The problem: benchmarks/single_node/dsv4_fp4_b300_sglang.sh already exists on main. It was created by PR #1146 (rename of the old dsv4_fp4_b300.sh to add the framework suffix), and was later edited to hardcode the low-latency recipe at every CONC (with a TODO(Cam) comment explicitly pointing at this branch chore/dsv4-sgl-b300 as the place to restore CONC dispatch). The branch's revert commit 90e8f3d recreated dsv4_fp4_b300.sh (the old, pre-rename path) but did not touch the framework-suffixed file — so after this PR merges, both files coexist on disk with different contents.

Why the new script never runs

runners/launch_b300-nv.sh:267-272:

BENCH_BASE="benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_b300"
BENCH_SCRIPT="${BENCH_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh"
if [[ ! -f "$BENCH_SCRIPT" ]]; then
    LEGACY_FW_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')
    BENCH_SCRIPT="${BENCH_BASE}${LEGACY_FW_SUFFIX}${SPEC_SUFFIX}.sh"
fi

The framework-tagged path is preferred; the bare-name fallback only fires when the tagged file is missing.

Step-by-step proof for one row

Take the new balanced row { tp: 8, ep: 8, dp-attn: true, conc-start: 64, conc-end: 128 } from the dsv4-fp4-b300-sglang 1k1k search-space:

  1. The runner is invoked with EXP_NAME=dsv4_1k1k (model-prefix dsv4, seq tag 1k1k), PRECISION=fp4, FRAMEWORK=sglang, no spec → SPEC_SUFFIX="".
  2. BENCH_BASE = "benchmarks/single_node/dsv4_fp4_b300".
  3. BENCH_SCRIPT = "benchmarks/single_node/dsv4_fp4_b300_sglang.sh" — and that file exists (verified on disk: 3008 bytes, blob c9fb238).
  4. The fallback to dsv4_fp4_b300.sh is not taken.
  5. The hardcoded RECIPE=low-latency block runs with --tp 8 (no --dp-size, no --enable-dp-attention, no --moe-a2a-backend deepep), regardless of CONC.
  6. But the result filename is composed from the YAML row's ep=8/dpa=true tags, so the output is mislabelled: low-latency numbers presented as balanced numbers.

The same logic applies to the max-throughput rows (CONC 256-1024 / 256-512). Effectively the entire YAML expansion in this PR is a no-op for the non-MTP config — the low-latency-only recipe runs at all concurrency points, and three out of four search-space rows produce misleading data.

Why MTP is unaffected

For the new dsv4-fp4-b300-sglang-mtp config, SPEC_SUFFIX=_mtp, so the runner first checks for dsv4_fp4_b300_sglang_mtp.sh — which does not exist — then falls back to dsv4_fp4_b300_mtp.sh, which is the new MTP script added in this PR. So MTP works correctly; only the non-MTP config is broken.

Fix

Either:

  • Rename benchmarks/single_node/dsv4_fp4_b300.shbenchmarks/single_node/dsv4_fp4_b300_sglang.sh (overwriting the stale file), or
  • Delete the existing benchmarks/single_node/dsv4_fp4_b300_sglang.sh so the new bare-name file is found via fallback.

The first option is the more direct fix since it leaves the framework-tagged convention intact (matching what PR #1146 standardised).

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

# The B300 runner overrides MODEL to a pre-staged /data/models path, so skip
# `hf download`. Only fetch when MODEL looks like a HF repo ID.
if [[ "$MODEL" != /* ]]; then
hf download "$MODEL"
fi

nvidia-smi

export SGLANG_JIT_DEEPGEMM_PRECOMPILE=0
# Cookbook note: "MTP currently requires SGLANG_ENABLE_SPEC_V2=1."
export SGLANG_ENABLE_SPEC_V2=1

# The deepseek-v4 sglang images (lmsysorg/sglang:deepseek-v4-blackwell and its
# B300 forks) bake CUDA_VISIBLE_DEVICES=4,5,6,7 into their ENV, which masks half
# of the 8 GPUs Slurm allocates us. Clear it so TP=8 can bind to all ranks.
unset CUDA_VISIBLE_DEVICES

# TODO(Cam): the deepseek-v4 sglang images install sglang editable at
# /workspace/sglang/python; prior sglang tags used /sgl-workspace/sglang.
# The runner mounts our repo at a non-/workspace path for these images so the
# editable install stays visible. Paths in this script are $PWD-relative for
# that reason. Drop the runner conditional once lmsys moves sglang back out of
# /workspace.

SERVER_LOG="$PWD/server.log"
PORT=${PORT:-8888}

echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL"

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi

start_gpu_monitor --output "$PWD/gpu_metrics.csv"

# Two recipes from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4
# with EAGLE / MTP enabled per the cookbook (prefix-caching dropped):
# - low-latency (CONC <= 32): TP-only + EAGLE 3 steps / 4 draft tokens
# - balanced (32 < CONC <= 128): + DP-attn + EAGLE 1 step / 2 draft tokens
# Max-throughput is intentionally not handled here -- the cookbook says
# MTP off at saturation because the verify step costs more than it saves.
# dsv4-fp4-b300-sglang-mtp's search-space caps CONC at 128 to match.
DEEPEP_CONFIG='{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}'

if [[ $CONC -le 32 ]]; then
RECIPE=low-latency
RECIPE_FLAGS=(
--moe-runner-backend flashinfer_mxfp4
--chunked-prefill-size 4096
--disable-flashinfer-autotune
--mem-fraction-static 0.82
--speculative-algo EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
)
else
RECIPE=balanced
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256
RECIPE_FLAGS=(
--dp-size "$TP"
--enable-dp-attention
--moe-a2a-backend deepep
--deepep-config "$DEEPEP_CONFIG"
--mem-fraction-static 0.82
--cuda-graph-max-bs 64
--max-running-requests 128
--speculative-algo EAGLE
--speculative-num-steps 1
--speculative-eagle-topk 1
--speculative-num-draft-tokens 2
)
fi
echo "Recipe: $RECIPE (CONC=$CONC)"

set -x
PYTHONNOUSERSITE=1 sglang serve \
--model-path $MODEL \
--host 0.0.0.0 \
--port $PORT \
--trust-remote-code \
--tp $TP \
--disable-radix-cache \
"${RECIPE_FLAGS[@]}" $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts $((CONC * 10)) \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir "$PWD/" \
--use-chat-template

if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
set +x
23 changes: 23 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1812,3 +1812,26 @@
- "Topologies: low-conc 1p1d-dep8-tep8 (4 nodes, mirrored from NVIDIA srt-slurm PR #71 with offload kept and numa-bind dropped); mid 1p1d-dep8-dep16 (6 nodes) and high 3p1d-dep8-dep16 (10 nodes) hand-rolled, structurally derived from the kimi-k2.5 1k/1k pattern"
- "Recipes stored under benchmarks/multi_node/srt-slurm-recipes/ and overlaid onto the upstream srt-slurm checkout at runtime"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1129

- config-keys:
- dsv4-fp4-b300-sglang
description:
- "Add DeepSeek-V4-Pro FP4 B300 SGLang benchmark"
- "Image: lmsysorg/sglang:deepseek-v4-blackwell"
- "Model: deepseek-ai/DeepSeek-V4-Pro (FP4 MoE experts + FP8 attention/dense)"
- "Reuses the B200 Pro Max-Throughput recipe from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4 on B300 until a B300-specific recipe ships"
- "DP=8 + DeepEP, prefix caching disabled, no speculative decoding"
- "Parallelism (TP=8/EP=8/dp-attn=true) and concurrency ranges (4-1024 for 1k1k, 4-512 for 8k1k) mirror dsv4-fp4-b200-vllm"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1132

- config-keys:
- dsv4-fp4-b300-sglang-mtp
description:
- "Add DeepSeek-V4-Pro FP4 B300 SGLang MTP benchmark"
- "Image: lmsysorg/sglang:deepseek-v4-b300"
- "Model: deepseek-ai/DeepSeek-V4-Pro (FP4 MoE experts + FP8 attention/dense)"
- "Mirrors dsv4-fp4-b300-sglang's low-latency and balanced rows with EAGLE MTP enabled per https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4"
- "EAGLE 3 steps / 4 draft tokens on low-latency, 1 step / 2 draft tokens on balanced"
- "Max-throughput intentionally omitted: cookbook says MTP off at saturation because the verify step costs more than it saves"
- "SGLANG_ENABLE_SPEC_V2=1 required for MTP; --use-chat-template passed to bench_serving"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1151
Loading