-
Notifications
You must be signed in to change notification settings - Fork 156
Add B300 config: dsv4-fp4-sglang-mtp #1151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
26e540d
feat: add DeepSeek-V4-Flash FP4 B300 SGLang benchmark
cquil11 efdc8ba
fix: switch dsv4-fp4-b300-sglang to Pro + Max-Throughput recipe
cquil11 cc35a12
chore: sync launch_b200-dgxc-slurm.sh cache mount from claude/add-dsv…
cquil11 404a097
fix: restore trailing whitespace stripped from glm5.1 changelog entry
cquil11 97a488e
chore: add flock-guarded squash import to B300 runner
cquil11 106deea
fix: drop ENROOT_CACHE_PATH override from B300 runner
cquil11 4bb1f1a
chore: point B300 runner at shared gharunners/{squash,hf-hub-cache}
cquil11 744c5a0
fix: move enroot import out of srun to avoid pyxis namespace collision
cquil11 d003c59
fix: wipe stale pyxis scratch dirs for this JOB_ID before benchmark srun
cquil11 f00629f
Revert: drop all B300 runner changes, mirror #1128's approach
cquil11 570b0eb
runner: add head-node flock-guarded squash import on B300
cquil11 864419d
fix: mount at /ix and clear baked-in CUDA_VISIBLE_DEVICES
cquil11 5d93913
Merge branch 'main' into chore/dsv4-sgl-b300
cquil11 9453676
runner: use /data/models pre-staged path for dsv4 on B300
cquil11 5db43b8
fix: switch B300 dsv4 sglang to bw-ultra-compiled image
cquil11 c060c58
fix: switch B300 dsv4 sglang image to yhyang201/sglang-b300:v3
cquil11 08edf26
update b300
cquil11 a699ca0
feat(dsv4-fp4-b300-sglang): pick recipe by CONC; split search-space
cquil11 d35696c
update b300
cquil11 bc43672
feat(dsv4-fp4-b300-sglang): hardcode low-latency recipe at every CONC
cquil11 87c8376
trigger test check
cquil11 aa423f0
Merge branch 'main' into chore/dsv4-sgl-b300
cquil11 90e8f3d
Revert "feat(dsv4-fp4-b300-sglang): hardcode low-latency recipe at ev…
cquil11 8e3158d
trigger test check
cquil11 78c2dae
Add B300 config: dsv4-fp4-sglang-mtp
cquil11 eb35ba1
Tighten dsv4 b300 sglang yaml: drop MTP max-throughput rows, fix stal…
cquil11 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,129 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| # The B300 runner overrides MODEL to a pre-staged /data/models path, so skip | ||
| # `hf download`. Only fetch when MODEL looks like a HF repo ID. | ||
| if [[ "$MODEL" != /* ]]; then | ||
| hf download "$MODEL" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| export SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 | ||
|
|
||
| # The deepseek-v4 sglang images (lmsysorg/sglang:deepseek-v4-blackwell and its | ||
| # B300 forks) bake CUDA_VISIBLE_DEVICES=4,5,6,7 into their ENV, which masks half | ||
| # of the 8 GPUs Slurm allocates us. Clear it so TP=8 can bind to all ranks. | ||
| unset CUDA_VISIBLE_DEVICES | ||
|
|
||
| # TODO(Cam): the deepseek-v4 sglang images install sglang editable at | ||
| # /workspace/sglang/python; prior sglang tags used /sgl-workspace/sglang. | ||
| # The runner mounts our repo at a non-/workspace path for these images so the | ||
| # editable install stays visible. Paths in this script are $PWD-relative for | ||
| # that reason. Drop the runner conditional once lmsys moves sglang back out of | ||
| # /workspace. | ||
|
|
||
| SERVER_LOG="$PWD/server.log" | ||
| PORT=${PORT:-8888} | ||
|
|
||
| echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL" | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| fi | ||
|
|
||
| start_gpu_monitor --output "$PWD/gpu_metrics.csv" | ||
|
|
||
| # Three recipes from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4 | ||
| # (spec-decoding / MTP and prefix-caching flags dropped for the baseline): | ||
| # - low-latency (CONC <= 32): TP-only, chunked-prefill, disable autotune | ||
| # - balanced (32 < CONC <= 128): + DP-attn, max-running-requests=128 | ||
| # - max-throughput (CONC > 128): + DP-attn, max-running-requests=256 | ||
| DEEPEP_CONFIG='{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}' | ||
|
|
||
| if [[ $CONC -le 32 ]]; then | ||
| RECIPE=low-latency | ||
| RECIPE_FLAGS=( | ||
| --moe-runner-backend flashinfer_mxfp4 | ||
| --chunked-prefill-size 4096 | ||
| --disable-flashinfer-autotune | ||
| --mem-fraction-static 0.82 | ||
| ) | ||
| elif [[ $CONC -le 128 ]]; then | ||
| RECIPE=balanced | ||
| export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256 | ||
| RECIPE_FLAGS=( | ||
| --dp-size "$TP" | ||
| --enable-dp-attention | ||
| --moe-a2a-backend deepep | ||
| --deepep-config "$DEEPEP_CONFIG" | ||
| --mem-fraction-static 0.82 | ||
| --cuda-graph-max-bs 64 | ||
| --max-running-requests 128 | ||
| ) | ||
| else | ||
| RECIPE=max-throughput | ||
| export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256 | ||
| RECIPE_FLAGS=( | ||
| --dp-size "$TP" | ||
| --enable-dp-attention | ||
| --moe-a2a-backend deepep | ||
| --deepep-config "$DEEPEP_CONFIG" | ||
| --mem-fraction-static 0.82 | ||
| --cuda-graph-max-bs 64 | ||
| --max-running-requests 256 | ||
| ) | ||
| fi | ||
| echo "Recipe: $RECIPE (CONC=$CONC)" | ||
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 sglang serve \ | ||
| --model-path $MODEL \ | ||
| --host 0.0.0.0 \ | ||
| --port $PORT \ | ||
| --trust-remote-code \ | ||
| --tp $TP \ | ||
| --disable-radix-cache \ | ||
| "${RECIPE_FLAGS[@]}" $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts $((CONC * 10)) \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir "$PWD/" | ||
|
|
||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor | ||
| set +x |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,130 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| # The B300 runner overrides MODEL to a pre-staged /data/models path, so skip | ||
| # `hf download`. Only fetch when MODEL looks like a HF repo ID. | ||
| if [[ "$MODEL" != /* ]]; then | ||
| hf download "$MODEL" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| export SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 | ||
| # Cookbook note: "MTP currently requires SGLANG_ENABLE_SPEC_V2=1." | ||
| export SGLANG_ENABLE_SPEC_V2=1 | ||
|
|
||
| # The deepseek-v4 sglang images (lmsysorg/sglang:deepseek-v4-blackwell and its | ||
| # B300 forks) bake CUDA_VISIBLE_DEVICES=4,5,6,7 into their ENV, which masks half | ||
| # of the 8 GPUs Slurm allocates us. Clear it so TP=8 can bind to all ranks. | ||
| unset CUDA_VISIBLE_DEVICES | ||
|
|
||
| # TODO(Cam): the deepseek-v4 sglang images install sglang editable at | ||
| # /workspace/sglang/python; prior sglang tags used /sgl-workspace/sglang. | ||
| # The runner mounts our repo at a non-/workspace path for these images so the | ||
| # editable install stays visible. Paths in this script are $PWD-relative for | ||
| # that reason. Drop the runner conditional once lmsys moves sglang back out of | ||
| # /workspace. | ||
|
|
||
| SERVER_LOG="$PWD/server.log" | ||
| PORT=${PORT:-8888} | ||
|
|
||
| echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL" | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| fi | ||
|
|
||
| start_gpu_monitor --output "$PWD/gpu_metrics.csv" | ||
|
|
||
| # Two recipes from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4 | ||
| # with EAGLE / MTP enabled per the cookbook (prefix-caching dropped): | ||
| # - low-latency (CONC <= 32): TP-only + EAGLE 3 steps / 4 draft tokens | ||
| # - balanced (32 < CONC <= 128): + DP-attn + EAGLE 1 step / 2 draft tokens | ||
| # Max-throughput is intentionally not handled here -- the cookbook says | ||
| # MTP off at saturation because the verify step costs more than it saves. | ||
| # dsv4-fp4-b300-sglang-mtp's search-space caps CONC at 128 to match. | ||
| DEEPEP_CONFIG='{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}' | ||
|
|
||
| if [[ $CONC -le 32 ]]; then | ||
| RECIPE=low-latency | ||
| RECIPE_FLAGS=( | ||
| --moe-runner-backend flashinfer_mxfp4 | ||
| --chunked-prefill-size 4096 | ||
| --disable-flashinfer-autotune | ||
| --mem-fraction-static 0.82 | ||
| --speculative-algo EAGLE | ||
| --speculative-num-steps 3 | ||
| --speculative-eagle-topk 1 | ||
| --speculative-num-draft-tokens 4 | ||
| ) | ||
| else | ||
| RECIPE=balanced | ||
| export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256 | ||
| RECIPE_FLAGS=( | ||
| --dp-size "$TP" | ||
| --enable-dp-attention | ||
| --moe-a2a-backend deepep | ||
| --deepep-config "$DEEPEP_CONFIG" | ||
| --mem-fraction-static 0.82 | ||
| --cuda-graph-max-bs 64 | ||
| --max-running-requests 128 | ||
| --speculative-algo EAGLE | ||
| --speculative-num-steps 1 | ||
| --speculative-eagle-topk 1 | ||
| --speculative-num-draft-tokens 2 | ||
| ) | ||
| fi | ||
| echo "Recipe: $RECIPE (CONC=$CONC)" | ||
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 sglang serve \ | ||
| --model-path $MODEL \ | ||
| --host 0.0.0.0 \ | ||
| --port $PORT \ | ||
| --trust-remote-code \ | ||
| --tp $TP \ | ||
| --disable-radix-cache \ | ||
| "${RECIPE_FLAGS[@]}" $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts $((CONC * 10)) \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir "$PWD/" \ | ||
| --use-chat-template | ||
|
|
||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor | ||
| set +x | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The new
benchmarks/single_node/dsv4_fp4_b300.shis dead code for thedsv4-fp4-b300-sglangconfig:benchmarks/single_node/dsv4_fp4_b300_sglang.sh(added by PR #1146) still exists with the hardcoded low-latency recipe, andrunners/launch_b300-nv.sh:267-272picks the framework-tagged file first and only falls back to the bare name when missing. The newly added balanced and max-throughput rows (tp=8 ep=8 dp-attn=true, conc 64-1024 / 64-512) will execute the stale low-latency recipe but produce result filenames taggedep=8/dpa=true, mislabelling the data. Fix: renamedsv4_fp4_b300.sh→dsv4_fp4_b300_sglang.sh(overwriting), or delete the staledsv4_fp4_b300_sglang.sh. The MTP variant is unaffected becausedsv4_fp4_b300_sglang_mtp.shdoes not exist.Extended reasoning...
What goes wrong
This PR adds two new benchmark scripts:
benchmarks/single_node/dsv4_fp4_b300.sh— recipe-per-CONC dispatch (low-latency / balanced / max-throughput)benchmarks/single_node/dsv4_fp4_b300_mtp.sh— MTP variant of the same dispatchIt also expands the
dsv4-fp4-b300-sglangYAML from a single TP-only(tp:8, ep:1, conc 4-1024/4-512)row into three recipes per seq-len, with the inline comment "are selected inside benchmarks/single_node/dsv4_fp4_b300.sh by CONC".The problem:
benchmarks/single_node/dsv4_fp4_b300_sglang.shalready exists onmain. It was created by PR #1146 (rename of the olddsv4_fp4_b300.shto add the framework suffix), and was later edited to hardcode the low-latency recipe at every CONC (with aTODO(Cam)comment explicitly pointing at this branchchore/dsv4-sgl-b300as the place to restore CONC dispatch). The branch's revert commit90e8f3drecreateddsv4_fp4_b300.sh(the old, pre-rename path) but did not touch the framework-suffixed file — so after this PR merges, both files coexist on disk with different contents.Why the new script never runs
runners/launch_b300-nv.sh:267-272:The framework-tagged path is preferred; the bare-name fallback only fires when the tagged file is missing.
Step-by-step proof for one row
Take the new balanced row
{ tp: 8, ep: 8, dp-attn: true, conc-start: 64, conc-end: 128 }from thedsv4-fp4-b300-sglang1k1k search-space:EXP_NAME=dsv4_1k1k(model-prefixdsv4, seq tag1k1k),PRECISION=fp4,FRAMEWORK=sglang, no spec →SPEC_SUFFIX="".BENCH_BASE = "benchmarks/single_node/dsv4_fp4_b300".BENCH_SCRIPT = "benchmarks/single_node/dsv4_fp4_b300_sglang.sh"— and that file exists (verified on disk: 3008 bytes, blob c9fb238).dsv4_fp4_b300.shis not taken.RECIPE=low-latencyblock runs with--tp 8(no--dp-size, no--enable-dp-attention, no--moe-a2a-backend deepep), regardless of CONC.ep=8/dpa=truetags, so the output is mislabelled: low-latency numbers presented as balanced numbers.The same logic applies to the max-throughput rows (CONC 256-1024 / 256-512). Effectively the entire YAML expansion in this PR is a no-op for the non-MTP config — the low-latency-only recipe runs at all concurrency points, and three out of four search-space rows produce misleading data.
Why MTP is unaffected
For the new
dsv4-fp4-b300-sglang-mtpconfig,SPEC_SUFFIX=_mtp, so the runner first checks fordsv4_fp4_b300_sglang_mtp.sh— which does not exist — then falls back todsv4_fp4_b300_mtp.sh, which is the new MTP script added in this PR. So MTP works correctly; only the non-MTP config is broken.Fix
Either:
benchmarks/single_node/dsv4_fp4_b300.sh→benchmarks/single_node/dsv4_fp4_b300_sglang.sh(overwriting the stale file), orbenchmarks/single_node/dsv4_fp4_b300_sglang.shso the new bare-name file is found via fallback.The first option is the more direct fix since it leaves the framework-tagged convention intact (matching what PR #1146 standardised).