-
Notifications
You must be signed in to change notification settings - Fork 156
Add B300 config: dsr1-fp8-sglang (non-MTP) #1050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,113 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # NOTE: At the time of submission, https://cookbook.sglang.io/autoregressive/DeepSeek/DeepSeek-R1 | ||
| # does not have a B300-specific recipe, so this script reuses the existing | ||
| # DSR1 FP8 B200 SGLang recipe as-is until B300-specific tuning is available. | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| EP_SIZE | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| export SGL_ENABLE_JIT_DEEPGEMM=false | ||
| export SGLANG_ENABLE_FLASHINFER_GEMM=true | ||
|
Check failure on line 28 in benchmarks/single_node/dsr1_fp8_b300.sh
|
||
|
Comment on lines
+27
to
+28
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 dsr1_fp8_b300.sh was copied verbatim from the B200 script and is missing two B300-specific adaptations that will cause suboptimal benchmark throughput on B300 hardware. First, lines 27-28 carry over Extended reasoning...What the bugs are and how they manifest
Bug 1 — Missing Bug 2 — B200-specific env vars carried into B300: Lines 27-28 set Why the "reuse B200 recipe as-is" framing does not excuse these omissions One verifier argues these carry-overs are intentional because the PR description says it reuses the B200 recipe "as-is". However, this argument applies differently to the two issues. For Bug 1 ( Step-by-step proof
Fix Remove lines 27-28 ( |
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| # Default: recv every ~10 requests; if CONC ≥ 16, relax to ~30 requests between scheduler recv polls. | ||
| if [[ $TP -eq 8 ]]; then | ||
| if [[ $CONC -ge 16 ]]; then | ||
| SCHEDULER_RECV_INTERVAL=30 | ||
| else | ||
| SCHEDULER_RECV_INTERVAL=10 | ||
| fi | ||
|
|
||
| # Setting these values (passed in to --cuda-graph-max-bs and --max-running-requests) as the maximum concurrency | ||
| # this will help us save memory from being unnecessary used. | ||
| MAX_RUNNING_REQUESTS=128 | ||
| CUDA_GRAPH_MAX_BATCH_SIZE=128 | ||
|
|
||
| MEM_FRAC_STATIC=0.82 | ||
| CHUNKED_PREFILL_SIZE=32768 | ||
| MAX_PREFILL_TOKENS=32768 | ||
| elif [[ $TP -eq 4 ]]; then | ||
| if [[ $ISL -ne 8192 ]] || [[ $OSL -ne 1024 ]]; then | ||
| echo "TP=4 not yet supported for ISL=$ISL OSL=$OSL!" | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Setting these values (passed in to --cuda-graph-max-bs and --max-running-requests) as the maximum concurrency | ||
| # this will help us save memory from being unnecessary used. | ||
| MAX_RUNNING_REQUESTS=32 | ||
| CUDA_GRAPH_MAX_BATCH_SIZE=32 | ||
|
|
||
| MEM_FRAC_STATIC=0.95 | ||
| CHUNKED_PREFILL_SIZE=8192 | ||
| MAX_PREFILL_TOKENS=8192 | ||
|
|
||
| SCHEDULER_RECV_INTERVAL=10 | ||
| else | ||
| echo "Unrecognized TP size $TP!" | ||
| exit 1 | ||
| fi | ||
| echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL" | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| fi | ||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \ | ||
| --tensor-parallel-size=$TP --data-parallel-size=1 \ | ||
| --cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \ | ||
| --mem-fraction-static $MEM_FRAC_STATIC --kv-cache-dtype fp8_e4m3 --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \ | ||
| --enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL --disable-radix-cache \ | ||
| --attention-backend trtllm_mla --stream-interval 30 --ep-size $EP_SIZE --moe-runner-backend flashinfer_trtllm --quantization fp8 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 The new
dsr1_fp8_b300.shcallshf download "$MODEL"at line 25, but the B300 runner (launch_b300-nv.sh) overridesMODELto a local filesystem path (e.g./scratch/models/DeepSeek-R1-0528) before launching the container. This causeshf downloadto fail on every B300 CI run, producing error noise — though the script continues since there is noset -e. The fix is to remove thehf downloadcall, as was done correctly in the only other B300 single-node script (qwen3.5_fp8_b300_mtp.sh).Extended reasoning...
What the bug is:
dsr1_fp8_b300.shwas copied verbatim from the B200 script and includes a call tohf download "$MODEL"at line 25. On B200,MODELremains a valid HuggingFace repo ID (e.g.deepseek-ai/DeepSeek-R1-0528), so the call succeeds. On B300, however, the runner overridesMODELto a local path before this code runs.The specific code path that triggers it: In
launch_b300-nv.sh, the single-node (non-multinode) branch at line 220 executes:export MODEL="/scratch/models/${MODEL#*/}". This strips the HuggingFace org prefix and prepends the local scratch directory. Sodeepseek-ai/DeepSeek-R1-0528becomes/scratch/models/DeepSeek-R1-0528inside the container. Whendsr1_fp8_b300.shsubsequently runshf download "/scratch/models/DeepSeek-R1-0528", it passes an absolute filesystem path as the repo ID, which is not a valid HuggingFace repository identifier.Why existing code doesn't prevent it: The script has no
set -e, so whenhf downloadfails, execution continues. The SGLang server launch at the bottom uses--model-path=$MODEL, which correctly references the local path — so the benchmark itself runs fine. The download failure is silently swallowed, appearing only as error noise in CI logs.Impact: Every B300 CI run for
dsr1-fp8-b300-sglangwill produce an error from the failedhf downloadcall. While not functionally blocking (the model loads from the pre-cached local path), it pollutes CI logs, can mask real errors, and violates the established B300 pattern.How to fix it: Remove line 25 (
hf download "$MODEL") fromdsr1_fp8_b300.sh. This matches exactly whatqwen3.5_fp8_b300_mtp.shdoes — the only other B300 single-node benchmark script — which intentionally omits the download step because models are pre-cached at/scratch/models/on B300 runners.Step-by-step proof:
dsr1-fp8-b300-sglangbenchmark.launch_b300-nv.shsetsHF_HUB_CACHE_MOUNT="/scratch/models"and runsexport MODEL="/scratch/models/${MODEL#*/}"→MODELbecomes/scratch/models/DeepSeek-R1-0528.dsr1_fp8_b300.shruns.hf download "/scratch/models/DeepSeek-R1-0528"— the HuggingFace CLI receives an absolute path instead of aorg/repoidentifier, rejects it as an invalid repo ID, and exits with an error.set -e, execution continues past the error.--model-path=/scratch/models/DeepSeek-R1-0528and loads the model correctly.hf downloaderror on every run.