Switch dsv4 B200 SGLang to DeepEP dp-attention recipe#1140
Switch dsv4 B200 SGLang to DeepEP dp-attention recipe#1140yhyang201 wants to merge 15 commits intoSemiAnalysisAI:mainfrom
Conversation
Adds the DeepSeek-V4-Flash B200 SGLang recipe from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4. Prefix caching and speculative decoding are disabled for baseline numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Uses deepseek-ai/DeepSeek-V4-Pro with tp=8, ep=8, dp-attention enabled and sweep concurrency ranges aligned with dsv4-fp4-b200-vllm (4-1024 at 1k/1k, 4-512 at 8k/1k). Script now passes --enable-dp-attention when DP_ATTENTION=true and sets --mem-fraction-static per the Pro recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Server launch now mirrors the DeepSeek-V4-Pro command from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4: --tp N, --moe-runner-backend flashinfer_mxfp4, --mem-fraction-static 0.82, SGLANG_JIT_DEEPGEMM_PRECOMPILE=0. Speculative decoding omitted and --disable-radix-cache added per the no-spec / no-prefix-cache baseline. YAML search-space drops ep/dp-attn to tp=8, ep=1. Also syncs runners/launch_b200-dgxc-slurm.sh with the HF cache mount path from origin/claude/add-dsv4-fp4-b200-vllm so both PRs stay in agreement on runner layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The deepseek-v4-blackwell image doesn't expose sglang via system python3, so the module import fails: /usr/bin/python3: Error while finding module specification for 'sglang.launch_server' (ModuleNotFoundError: No module named 'sglang') Switch to the `sglang serve` entrypoint that the cookbook uses; the CLI resolves the correct interpreter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lmsysorg/sglang:deepseek-v4-blackwell image installs sglang editable at /workspace/sglang/python — unlike every prior sglang tag which uses /sgl-workspace/sglang. Our $GITHUB_WORKSPACE:/workspace/ bind-mount masks that directory, breaking `import sglang`. Conditionally mount at /ix for this image only and make the dsv4 benchmark script use $PWD for server/metrics/result paths so it works regardless of the mount target. All other configs still mount at /workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lmsysorg/sglang:deepseek-v4-blackwell image installs sglang editable at /workspace/sglang/python, which our $GITHUB_WORKSPACE:/workspace/ bind-mount masks. Temporary one-line workaround: pip install --no-deps sglang in the benchmark script to restore a non-editable copy in site-packages. Runner reverted to the standard /workspace mount. Marked with a TODO(Cam) for the proper fix once lmsys publishes an image that doesn't editable-install under /workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
'pip install --no-deps sglang' is a no-op when sglang is already registered in site-packages -- even if the underlying editable path is missing -- so the prior workaround never actually swapped in a working install. Uninstall the broken egg-link first, then reinstall. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Back to the proper mount fix so we use the same 'PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...' invocation as every other sglang single_node script. Conditional mount target keeps the blast radius to this one config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The image ENV pins CUDA_VISIBLE_DEVICES=4,5,6,7 (leftover from lmsys's internal testing). With --no-container-entrypoint it isn't cleared, so the container only sees 4 GPUs and TP=8 fails with torch.AcceleratorError: CUDA error: invalid device ordinal Unset it at the top of the script so Slurm's 8-GPU allocation is visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Only patched launch_b200-dgxc-slurm.sh last time; the b200-nb runner still had the default $GITHUB_WORKSPACE:/workspace/ mount, which masks the deepseek-v4-blackwell image's /workspace/sglang editable install. Most B200 jobs in this repo run on b200-nb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ding Only replace the sglang launch command, keep all surrounding logic intact. Add PYTHONNOUSERSITE=1, SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1, SGLANG_OPT_USE_TOPK_V2=1 env prefixes. Switch to sglang serve with EAGLE speculative decoding (3 steps, topk=1, 4 draft tokens), chunked prefill 4096, and disable-flashinfer-autotune. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace single-node TP8 flashinfer_mxfp4 recipe with TP8/DP8 dp-attention + DeepEP MoE backend. EAGLE spec decoding reduced to 1 step / 2 draft tokens. Adds mega-MOE optimizations and DeepEP dispatch/combine config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EAGLE speculative decoding is enabled in the benchmark script, so the YAML search-space entries need spec-decoding: "mtp" to ensure correct classification in config generation and eval selection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b200-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24895270642 |
Copy of dsv4_fp4_b200.sh with --use-chat-template added to run_benchmark_serving, as required by AGENTS.md for MTP scripts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b200-sglang |
|
@yhyang201 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24900476610 |
|
closing in favor of #1187 |
Summary
--moe-a2a-backend deepepand--enable-dp-attentionTest plan
🤖 Generated with Claude Code