Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
93db2e2
Day 0 DeepSeek V4 Pro FP4 GB200 disaggregated SGLang benchmarks
Oseltamivir Apr 25, 2026
1bc4c2e
Drop unsupported backend.connector field from sglang recipes
Oseltamivir Apr 25, 2026
c0d477d
Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg
Oseltamivir Apr 25, 2026
65b8b17
Drop dynamo: version: 0.8.1 — incompatible with deepseek-v4-grace-bla…
Oseltamivir Apr 25, 2026
9d883ba
Add dynamo: install: false — srtctl default is install=True
Oseltamivir Apr 25, 2026
1b75dd7
Pin dynamo to v1.2.0-sglang-deepseek-v4-dev.1 tag (hash 21f135f5)
Oseltamivir Apr 25, 2026
eb3f62c
Force deepep-mode: low_latency to work around mxfp4+DeepEP normal-dis…
Oseltamivir Apr 25, 2026
6c608df
Drop DeepEP / DP-attn / EP — fork-only mxfp4_deepseek bug, both dispa…
Oseltamivir Apr 25, 2026
2bb3ef0
Add moe-dense-tp-size: 1 — fix shared-experts FP8 block-quant divisib…
Oseltamivir Apr 25, 2026
d34d894
Set SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024 in all env bl…
Oseltamivir Apr 25, 2026
c24f25b
Switch to TP=4 single-node — match PR #75 verbatim, fix FP8 block-quant
Oseltamivir Apr 25, 2026
c0aec93
Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg
Oseltamivir Apr 25, 2026
8316d3f
Restore mi355x retry changelog entries clobbered by merge
Oseltamivir Apr 25, 2026
f089567
Switch back to TP=8: enable-dp-attention + moe-dense-tp-size: 1, no m…
Oseltamivir Apr 26, 2026
34e4a92
Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg
Oseltamivir Apr 26, 2026
5b6eb2f
Scope sweep to high-conc DeepEP only — temporarily comment 1p1d blocks
Oseltamivir Apr 26, 2026
b913586
tep fix + dep for high conc
Oseltamivir Apr 26, 2026
bca99eb
sike no dpa
Oseltamivir Apr 26, 2026
6c09973
Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg
Oseltamivir Apr 26, 2026
5866658
Cap SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK at 1024 — sglang L…
Oseltamivir Apr 26, 2026
c0fc3bb
Revert 3p1d-dep8-dep16 to no-DeepEP TP-only; uncomment full 1k/1k + 8…
Oseltamivir Apr 26, 2026
0526fa0
Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg
Oseltamivir Apr 26, 2026
30c2512
Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg
Oseltamivir Apr 27, 2026
bc9fccf
Try moe-a2a-backend: flashinfer on 3p1d-dep8-dep16 for high-conc EP
Oseltamivir Apr 27, 2026
8ea8e77
Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg
Oseltamivir Apr 27, 2026
e6d8943
Revert flashinfer EP attempt — accept TP-only pareto, every EP backen…
Oseltamivir Apr 27, 2026
90304df
Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg
Oseltamivir Apr 27, 2026
1d27533
fix(sglang): bump 8k1k prefill max-running-requests from 4 to 8
Oseltamivir Apr 27, 2026
a172069
Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg
Oseltamivir Apr 27, 2026
df1c783
ports
Oseltamivir Apr 28, 2026
513cbef
Dsv4 fp4 gb200 dynamo sglang disagg (#1213)
ch-wan Apr 28, 2026
fa876e3
Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg
Oseltamivir Apr 28, 2026
b27c8da
adapt for model path, etc
Oseltamivir Apr 28, 2026
0dbc9a4
dev
ch-wan Apr 28, 2026
ba72558
upd
ch-wan Apr 28, 2026
7c81fe9
fix
ch-wan Apr 28, 2026
7a1daaf
fix
ch-wan Apr 28, 2026
8ce4965
Merge branch 'main' into dsv4-fp4-gb200-dynamo-sglang-disagg
ch-wan Apr 28, 2026
c454ad3
test
ch-wan Apr 28, 2026
bac301d
add gb300
ch-wan Apr 28, 2026
1167f64
upd
ch-wan Apr 28, 2026
cfae9ae
fix
ch-wan Apr 28, 2026
8aa71cd
Merge commit '06596136c1e0115106ed051af12ca630796b228e' into dsv4-fp4…
ch-wan Apr 28, 2026
0443a1f
fix
ch-wan Apr 28, 2026
387726d
fix
ch-wan Apr 29, 2026
fe6815c
fix(launch_gb300-cw): register deepseek-v4-pro alias in model_paths
ch-wan Apr 29, 2026
b4d6c19
fix(launch_gb300-cw): pull arm64 squash and force fresh import per ru…
ch-wan Apr 29, 2026
cad94c9
fix(launch_gb300-cw): use enroot --arch aarch64, not arm64
ch-wan Apr 29, 2026
d6fc0e7
fix(launch_gb300-cw): use pre-staged arm64 sqsh, drop in-CI enroot im…
ch-wan Apr 29, 2026
da6f892
fix(launch_gb300-cw): persist dynamo wheel cache and ulimit preamble
ch-wan Apr 29, 2026
28d03e8
fix(sglang/dsv4/8k1k recipes): set cpus-per-task=144 for dynamo build
ch-wan Apr 29, 2026
16113f8
fix(sglang/dsv4/8k1k recipes): set cpus-per-task=144 and mem=0
ch-wan Apr 29, 2026
ade5488
fix(launch_gb300-cw): pin srt-slurm fork with parallel sa-bench
ch-wan Apr 29, 2026
b19eb9a
merge: origin/main into dsv4-fp4-gb200-dynamo-sglang-disagg
ch-wan Apr 29, 2026
152a059
fix(launch_gb300-cw): bump srt-slurm fork pin to minimal multiproc patch
fzyzcjy Apr 29, 2026
c435a65
ci: temporarily comment out conc-list:[64] 2p1d entry
fzyzcjy Apr 29, 2026
be12dba
ci(eval): temporarily skip dsv4-fp4-gb300 dynamo-sglang eval-only entry
fzyzcjy Apr 29, 2026
38acd77
bench(7p1d-dep4-dep8): swap sa-bench default for yangminl's gb300-cw …
fzyzcjy Apr 29, 2026
22c5e67
config(7p1d-dep4-dep8): align with job 564 — multi-frontend, sbatch d…
fzyzcjy Apr 29, 2026
15423f1
config(7p1d-dep4-dep8): keep PR name field, revert to original
fzyzcjy Apr 29, 2026
cba5297
Merge remote-tracking branch 'origin/main' into dsv4-fp4-gb200-dynamo…
fzyzcjy Apr 29, 2026
a1a6f8d
upd
ch-wan Apr 29, 2026
b146b86
fix
ch-wan Apr 29, 2026
f521e2e
Merge commit '3cfb0b9620ad1f11f9d9412409fb2f67a757c3d7' into dsv4-fp4…
ch-wan Apr 29, 2026
c843c0d
fix
ch-wan Apr 29, 2026
927edfe
middle
ch-wan Apr 29, 2026
c14d06d
fi
ch-wan Apr 29, 2026
7d977cf
Merge commit '182c80aaecb80fc79a074cc38876235a32013bcd' into dsv4-fp4…
ch-wan Apr 29, 2026
5e86ffc
fix
ch-wan Apr 29, 2026
5776fd5
upd
ch-wan Apr 30, 2026
b472c78
Merge commit '49651ae6b535c4df02e132d2a9877eb2a5c6ca30' into dsv4-fp4…
ch-wan Apr 30, 2026
fce13d0
fix
ch-wan Apr 30, 2026
484763a
upd
ch-wan Apr 30, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7748,3 +7748,101 @@ dsv4-fp4-gb200-dynamo-vllm:
tp: 8
ep: 8
dp-attn: true

dsv4-fp4-gb300-dynamo-sglang:
image: lmsysorg/sglang:deepseek-v4-grace-blackwell
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: gb300-cw
precision: fp4
framework: dynamo-sglang
multinode: true
disagg: true
seq-len-configs:
- isl: 8192
osl: 1024
search-space:
# WideEP TP=16 decode: 1p1d-dep4-dep16. 5 nodes (4P + 16D = 20 GPUs).
- conc-list: [512]
prefill:
num-worker: 1
tp: 4
ep: 4
dp-attn: true
additional-settings:
- "CONFIG_FILE=recipes/sglang/deepseek-v4/8k1k/conc512-20.yaml"
decode:
num-worker: 1
tp: 16
ep: 16
dp-attn: true
# DP-attn wideep: 1p1d-dep4-dep8. 3 nodes.
- conc-list: [512]
prefill:
num-worker: 1
tp: 4
ep: 4
dp-attn: true
additional-settings:
- "CONFIG_FILE=recipes/sglang/deepseek-v4/8k1k/conc512.yaml"
decode:
num-worker: 1
tp: 8
ep: 8
dp-attn: true
# DP-attn wideep: 2p1d-dep4-dep8. 4 nodes.
- conc-list: [1024]
prefill:
num-worker: 2
tp: 4
ep: 4
dp-attn: true
additional-settings:
- "CONFIG_FILE=recipes/sglang/deepseek-v4/8k1k/conc1024.yaml"
decode:
num-worker: 1
tp: 8
ep: 8
dp-attn: true
# Low concurrency
- conc-list: [1]
prefill:
num-worker: 1
tp: 4
ep: 1
dp-attn: false
additional-settings:
- "CONFIG_FILE=recipes/sglang/deepseek-v4/8k1k/conc1.yaml"
decode:
num-worker: 1
tp: 4
ep: 1
dp-attn: false
# Mid concurrency
- conc-list: [2048]
prefill:
num-worker: 4
tp: 4
ep: 4
dp-attn: true
additional-settings:
- "CONFIG_FILE=recipes/sglang/deepseek-v4/8k1k/conc2048.yaml"
decode:
num-worker: 1
tp: 8
ep: 8
dp-attn: true
# Max concurrency
- conc-list: [16384]
prefill:
num-worker: 14
tp: 4
ep: 4
dp-attn: true
additional-settings:
- "CONFIG_FILE=recipes/sglang/deepseek-v4/8k1k/conc16384.yaml"
decode:
num-worker: 1
tp: 16
ep: 16
dp-attn: true
5 changes: 5 additions & 0 deletions .github/configs/runners.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -139,3 +139,8 @@ gb300:
- 'gb300-nv_0'
- 'gb300-nv_1'
- 'gb300-nv_2'
gb300-cw:
- 'gb300-cw_0'
- 'gb300-cw_1'
- 'gb300-cw_2'
- 'gb300-cw_3'
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
name: "conc1"

# 8k/1k high-throughput topology for the wideep DSV4-Pro setup.
#
# Schema/values come from PR #1213 (513cbef) — that PR introduced the
# `dsv4-pro-gb300-fp4` upstream-style recipe with two `zip_override`
# variants (wideep [0] / narrow_ep [1]) and `backend.benchmark`. Our
# pinned srtctl (NVIDIA/srt-slurm @ sa-submission-q2-2026) doesn't
# support either: `zip_override_*_hightpt` rejects with `Unknown field`
# and `benchmark` only validates at top level. So this file inlines the
# wideep [0] override and lifts `benchmark` back out — same operational
# values, schema the pinned srtctl will accept.
#
# Other adjustments back to the InferenceX cluster shape: container &
# model.path restored to the aliases mapped in launch_gb300.sh's
# srtslurm.yaml (`lmsysorg/sglang:deepseek-v4-grace-blackwell` and
# `deepseek-v4-pro`); `dynamo.install: true` added so the container
# (which has no dynamo baked in) installs from the pinned hash.
#
# Cluster-specific items NOT inlined (require InferenceX-side equivalents):
# - slurm.partition (yangminl's gb300-cw uses `hpc-mid`)
# - frontend.nginx_container (yangminl's `nginx-1.27.4.sqsh` path)
# - extra_mount: yangminl/sglang-patched/sglang. Earlier diff analysis
# showed only `expert_location_dispatch.py` topk_ids int32 cast is an
# active runtime diff vs container sglang; other patched files are
# env-gated dead code under the same SGLANG_OPT_* flags this yaml
# already sets.
#
# DG-related env intentionally diverged (DG cache path is host-specific):
# - SGLANG_DG_CACHE_DIR=/configs/deepgemm_cache (yangminl host)
# - SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 (yangminl uses prebuilt cache)
# This yaml uses SGLANG_JIT_DEEPGEMM_FAST_WARMUP=1 instead.

model:
path: "deepseek-v4-pro"
container: "lmsysorg/sglang:deepseek-v4-grace-blackwell"
precision: "fp4"

# See ../1k1k/disagg-gb200-1p1d-dep8-tep8.yaml for the dynamo pin
# rationale. Hash bumped from PR #1213 to track the dynamo-sglang dsv4
# dev branch.
dynamo:
hash: "9d3c913d300eb368cda28b3f98a23a5762621e0d"
install: true

slurm:
time_limit: "03:00:00"

# Match yangminl's working all-dynamo.yaml on the same gb300-cw cluster:
# cpus-per-task=144 — without this slurm hands out 1 CPU/task, which
# turns the dynamo `hash:` cold source build (~500 rust crates,
# ravif/exr/zip/pyo3 stack) into a 30+ min serial compile. With 144
# cargo finishes in ~5 min.
# mem=0 — slurm's "give the whole node's memory"; needed
# for sglang loading 671B FP4 weights + dynamo build at the same
# time without OOM.
sbatch_directives:
cpus-per-task: "144"
mem: "0"

# Topology: 7 prefill (TP=4 / DP=4 / EP=4 / 1 node each) + 1 decode
# (TP=8 / DP=8 / EP=8 / 2 nodes). 9 nodes total.
resources:
gpu_type: "gb300"
gpus_per_node: 4
prefill_nodes: 1
prefill_workers: 1
gpus_per_prefill: 4
decode_nodes: 1
decode_workers: 1
gpus_per_decode: 4

frontend:
type: dynamo
enable_multiple_frontends: true
num_additional_frontends: 8

backend:
type: sglang

prefill_environment:
PYTHONUNBUFFERED: "1"
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
SGLANG_ENABLE_THINKING: "1"
SGLANG_REASONING_EFFORT: "max"
SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1"
SGLANG_OPT_USE_JIT_NORM: "1"
SGLANG_OPT_USE_JIT_INDEXER_METADATA: "1"
SGLANG_OPT_USE_TOPK_V2: "1"
NCCL_MNNVL_ENABLE: "1"
NCCL_CUMEM_ENABLE: "1"
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
MC_FORCE_MNNVL: "1"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1"

decode_environment:
PYTHONUNBUFFERED: "1"
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
SGLANG_ENABLE_THINKING: "1"
SGLANG_REASONING_EFFORT: "max"
SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1"
SGLANG_OPT_USE_JIT_NORM: "1"
SGLANG_OPT_USE_JIT_INDEXER_METADATA: "1"
SGLANG_OPT_USE_TOPK_V2: "1"
NCCL_MNNVL_ENABLE: "1"
NCCL_CUMEM_ENABLE: "1"
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
MC_FORCE_MNNVL: "1"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1"
# SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2
# is single-node only and corrupts results in 2-node decode setups.

sglang_config:
prefill:
served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
model-path: "/model/"
trust-remote-code: true
disable-radix-cache: true

disaggregation-mode: "prefill"
disaggregation-transfer-backend: mooncake

tensor-parallel-size: 4
data-parallel-size: 1
expert-parallel-size: 1

moe-runner-backend: "flashinfer_mxfp4"
disable-flashinfer-autotune: true

mem-fraction-static: 0.90
max-running-requests: 512
cuda-graph-max-bs: 512
chunked-prefill-size: 32768

decode:
served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
model-path: "/model/"
trust-remote-code: true
disable-radix-cache: true

disaggregation-mode: "decode"
disaggregation-transfer-backend: mooncake

tensor-parallel-size: 4
data-parallel-size: 1
expert-parallel-size: 1

moe-runner-backend: "flashinfer_mxfp4"
disable-flashinfer-autotune: true

mem-fraction-static: 0.9
max-running-requests: 1024
cuda-graph-max-bs: 512
swa-full-tokens-ratio: 0.1
context-length: 16384

benchmark:
type: "sa-bench"
isl: 8192
osl: 1024
concurrencies: "1"
req_rate: "inf"
use_chat_template: false
Loading
Loading