-
Notifications
You must be signed in to change notification settings - Fork 156
Add B300 config: dsr1-fp4-sglang (non-MTP) #1049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b14590a
ae5e0bb
8ef6498
20dbe26
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # NOTE: At the time of submission, https://cookbook.sglang.io/autoregressive/DeepSeek/DeepSeek-R1 | ||
| # does not have a B300-specific recipe, so this script reuses the existing | ||
| # DSR1 FP4 B200 SGLang recipe as-is until B300-specific tuning is available. | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| EP_SIZE | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| nvidia-smi | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| # Default: recv every ~10 requests; if CONC ≥ 16, relax to ~30 requests between scheduler recv polls. | ||
| if [[ $CONC -ge 16 ]]; then | ||
| SCHEDULER_RECV_INTERVAL=30 | ||
| else | ||
| SCHEDULER_RECV_INTERVAL=10 | ||
| fi | ||
| echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL" | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| fi | ||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path $MODEL --host 0.0.0.0 --port $PORT --trust-remote-code \ | ||
| --tensor-parallel-size=$TP --data-parallel-size=1 \ | ||
| --cuda-graph-max-bs 256 --max-running-requests 256 --mem-fraction-static 0.85 --kv-cache-dtype fp8_e4m3 \ | ||
| --chunked-prefill-size 16384 \ | ||
| --ep-size $EP_SIZE --quantization modelopt_fp4 --enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \ | ||
| --enable-symm-mem --disable-radix-cache --attention-backend trtllm_mla --moe-runner-backend flashinfer_trtllm --stream-interval 10 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts $((CONC * 10)) \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -216,10 +216,15 @@ | |
|
|
||
| else | ||
|
|
||
| HF_HUB_CACHE_MOUNT="/scratch/models" | ||
| export MODEL="/scratch/models/${MODEL#*/}" | ||
| # Qwen3.5-397B-A17B-FP8 is pre-staged under /scratch/models on the B300 cluster, | ||
| # so point MODEL at the local copy. Other models fall through and use `hf download` | ||
| # against the mounted cache from their benchmark script. | ||
| if [[ "$MODEL" == "Qwen/Qwen3.5-397B-A17B-FP8" ]]; then | ||
| export MODEL="/scratch/models/${MODEL#*/}" | ||
| fi | ||
| SQUASH_FILE="/data/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" | ||
| FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '') | ||
|
Check failure on line 227 in runners/launch_b300-nv.sh
|
||
|
Comment on lines
219
to
227
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 The MODEL path rewrite in launch_b300-nv.sh was narrowed to only apply to Qwen/Qwen3.5-397B-A17B-FP8; for the new dsr1-fp4-b300-sglang config (model: nvidia/DeepSeek-R1-0528-FP4-V2), MODEL is never rewritten and remains the HuggingFace repo ID. Inside the B300 single-node container, HF_HUB_CACHE=/mnt/hf_hub_cache/ (set by CI) is exported via --export=ALL but /mnt/hf_hub_cache/ is never mounted — only /scratch/models:/scratch/models is mounted — so hf download nvidia/DeepSeek-R1-0528-FP4-V2 cannot find the pre-staged model and may attempt a ~600 GB internet download, and --model-path nvidia/DeepSeek-R1-0528-FP4-V2 will cause SGLang to fail to start. Fix: add nvidia/DeepSeek-R1-0528-FP4-V2 to the if condition in the single-node else branch so MODEL is rewritten to /scratch/models/DeepSeek-R1-0528-FP4-V2, matching the Qwen pattern and consistent with what the multinode branch already does for this model. Extended reasoning...What the bug is and how it manifests PR #1035 introduced the B300 single-node runner and originally rewrote MODEL unconditionally: export MODEL="/scratch/models/${MODEL#*/}". This PR changed that to a conditional that only rewrites when MODEL == "Qwen/Qwen3.5-397B-A17B-FP8", with a comment saying other models "fall through and use hf download". For the new dsr1-fp4-b300-sglang config (model: nvidia/DeepSeek-R1-0528-FP4-V2), MODEL therefore remains the raw HuggingFace repo ID when the benchmark script is invoked. The specific code path that triggers it benchmark-tmpl.yml line 74 sets HF_HUB_CACHE=/mnt/hf_hub_cache/. The B300 single-node runner sets HF_HUB_CACHE_MOUNT="/scratch/models" and mounts it as --container-mounts=...,: — i.e. /scratch/models:/scratch/models. It passes --export=ALL which exports HF_HUB_CACHE=/mnt/hf_hub_cache/ into the container, but /mnt/hf_hub_cache/ is never mounted there. When dsr1_fp4_b300.sh runs inside the container, MODEL=nvidia/DeepSeek-R1-0528-FP4-V2, so line 23 (hf download "") invokes huggingface-cli against HF_HUB_CACHE=/mnt/hf_hub_cache/ (non-existent in the container). The model cannot be found and huggingface-cli will attempt a ~600 GB internet download or fail. Similarly, --model-path nvidia/DeepSeek-R1-0528-FP4-V2 on the SGLang launch command causes SGLang to attempt the same broken HF resolution path, preventing server startup entirely. Why existing code does not prevent it The PR comment says other models can use hf download against the mounted cache. However, the cache mount is at /scratch/models:/scratch/models, while HF_HUB_CACHE (exported from the host) points to /mnt/hf_hub_cache/. These paths are misaligned, so HF cache lookup fails inside the container. B200 correctly handles this by mounting -v : (i.e. /raid/hf_hub_cache/:/mnt/hf_hub_cache/), making the mount point match the env var. B300 does not do this — it mounts at /scratch/models but exports HF_HUB_CACHE=/mnt/hf_hub_cache/. What the impact would be The dsr1-fp4-b300-sglang benchmark will fail entirely: SGLang cannot locate the pre-staged model weights, so the server fails to start and all benchmark runs produce no results. At best, a ~600 GB download is attempted and times out; at worst the job fails immediately. This completely blocks the new config from producing any benchmark data. Step-by-step proof
|
||
| SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '') | ||
|
|
||
| # Pin to one of the known-good B300 nodes; others have hardware/network | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.