diff --git a/.github/configs/CONFIGS.md b/.github/configs/CONFIGS.md index 9d3c24309..302605fbb 100644 --- a/.github/configs/CONFIGS.md +++ b/.github/configs/CONFIGS.md @@ -47,6 +47,58 @@ Notes: - No extra fields besides the ones listed may be specified, or else the benchmarks will fail to run. - Setting the fields above, particularly `ep` and `dp-attn`, only guarantee that the respective values will be passed as environment variables to the benchmark scripts! Actually using those environment variables is an implementation detail at the level of the benchmark Bash script. +## Multi-node srt-slurm recipes + +Multi-node configs that dispatch via `srt-slurm` (i.e. `srtctl apply -f …`) reference their recipe as a first-class field on the search-space entry: + +```yaml +search-space: +- spec-decoding: "mtp" + conc-list: [1214] + recipe: "trtllm/b200-fp4/1k1k/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml" + prefill: + num-worker: 1 + tp: 4 + ep: 4 + dp-attn: true + decode: + num-worker: 2 + tp: 8 + ep: 8 + dp-attn: true +``` + +- `recipe` is a path **relative to `benchmarks/multi_node/srt-slurm-recipes/`** in this repo. The schema validator rejects entries whose recipe file does not exist on disk, so adding a new entry requires upstreaming the recipe yaml here first. +- The path may carry an `:override[N]` / `:override_` suffix to select a named override section inside an sglang-style recipe yaml (e.g. `"dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_mtp_lowlat[0]"`). The launcher strips this suffix before reading the file but passes the full string to `srtctl`. +- `recipe` is optional: multi-node entries that do *not* go through srt-slurm (e.g. dynamo-sglang aggregated topologies that drive their own bash) leave it unset. +- Recipes live under `benchmarks/multi_node/srt-slurm-recipes/` organized as `//-////.yaml` — e.g. `dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml`. A handful of sglang-style files that carry override sections spanning both stp and mtp are parked one level shallower (the trailing `/` segment is omitted). The benchmark template resolves `recipe` to an absolute path and passes it to the launcher as `CONFIG_FILE`, so launchers do not see the relative form. + +### Custom-script benchmarking + +Recipes are migrating from srt-slurm's bundled `benchmark.type: sa-bench` to `benchmark.type: custom` so the benchmark client lives in this repo (`utils/bench_serving/benchmark_serving.py`) instead of being maintained twice. New shape: + +```yaml +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" # per prefill worker (filename component) + DECODE_GPUS: "8" # per decode worker (filename component) + TOTAL_GPUS: "20" # sum across workers (filename component) + # MODEL_NAME: "..." # only when server's served-model-name + # differs from master-yaml's `model:` + # USE_CHAT_TEMPLATE: "false" # only when overriding default (true) +``` + +`MODEL`, `ISL`, `OSL`, `CONC_LIST`, `DISAGG`, `RANDOM_RANGE_RATIO` are exported by `benchmark-multinode-tmpl.yml` at the workflow step and propagate through the launcher → `srtctl` → `srun` (default `--export=ALL`) → pyxis into the benchmark container, so they don't need to be re-declared in `benchmark.env`. The recipe only carries per-recipe topology knobs (`PREFILL_GPUS`/`DECODE_GPUS`/`TOTAL_GPUS`, used in the result filename) plus the rare overrides (`MODEL_NAME` when the server's served-model-name diverges from `model:`, `USE_CHAT_TEMPLATE: false` for tokenizers that have no chat template, etc.). + +`benchmarks/multi_node/srt_bench.sh` is a thin wrapper around `run_benchmark_serving()` in `benchmarks/benchmark_lib.sh` (the same shim every single-node bench script uses). It loops once per concurrency in `$CONC_LIST` and writes results to `/logs/sa-bench_isl__osl_/results_concurrency__gpus__ctx_

_gen_.json` so existing launcher result-harvesters pick them up unchanged. Tokenizer is loaded from `/model` — `srtctl`'s `RuntimeContext.create` auto-mounts the model dir at that path in every container, so we don't need any HF Hub egress. + +The `container_mounts` block bind-mounts the host-side `$INFMAX_WORKSPACE` (set by the launcher to `$GITHUB_WORKSPACE`) at `/infmax-workspace` inside srt-slurm's benchmark container, so the wrapper and bench client are reachable at known paths. `srtctl` resolves `$INFMAX_WORKSPACE` via `os.path.expandvars` at submission time. + ## Runners The `runners.yaml` config represents the available runners in the repository. The keys are the runner *types* (i.e., the GPUs as well as some specific combinations like `b200-trt`) whereas the value is a list of *runner nodes*. This config is used to verify the master configs. diff --git a/.github/configs/nvidia-master.yaml b/.github/configs/nvidia-master.yaml index f13b8b6dd..eb8ad8678 100644 --- a/.github/configs/nvidia-master.yaml +++ b/.github/configs/nvidia-master.yaml @@ -13,14 +13,12 @@ dsr1-fp4-b200-dynamo-trt: search-space: - spec-decoding: "mtp" conc-list: [1214] + recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml" decode: num-worker: 2 tp: 8 @@ -28,14 +26,12 @@ dsr1-fp4-b200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [875] + recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp3.yaml" decode: num-worker: 5 tp: 8 @@ -43,14 +39,12 @@ dsr1-fp4-b200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [6] + recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml" decode: num-worker: 5 tp: 8 @@ -58,14 +52,12 @@ dsr1-fp4-b200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [10, 15, 25, 45, 90, 180] + recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml" decode: num-worker: 5 tp: 8 @@ -73,14 +65,12 @@ dsr1-fp4-b200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [ 4968 ] + recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen4_dep8_batch128_eplb0_mtp1.yaml" prefill: num-worker: 3 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/mtp/ctx3_gen4_dep8_batch128_eplb0_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/mtp/ctx3_gen4_dep8_batch128_eplb0_mtp1.yaml" decode: num-worker: 4 tp: 8 @@ -88,14 +78,12 @@ dsr1-fp4-b200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [10860] + recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch512_eplb0_mtp1.yaml" prefill: num-worker: 3 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/mtp/ctx3_gen5_dep4_batch512_eplb0_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/mtp/ctx3_gen5_dep4_batch512_eplb0_mtp1.yaml" decode: num-worker: 5 tp: 4 @@ -104,84 +92,72 @@ dsr1-fp4-b200-dynamo-trt: # Non-MTP configurations - conc-list: [4096] + recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [2192] + recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch128_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen2_dep8_batch128_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen2_dep8_batch128_eplb0_mtp0.yaml" decode: num-worker: 2 tp: 8 ep: 8 dp-attn: true - conc-list: [1365] + recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_dep8_batch32_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen5_dep8_batch32_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen5_dep8_batch32_eplb0_mtp0.yaml" decode: num-worker: 5 tp: 8 ep: 8 dp-attn: true - conc-list: [6] + recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml" decode: num-worker: 5 tp: 8 ep: 8 dp-attn: false - conc-list: [10, 15, 25, 45, 90, 180] + recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch32_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen5_tep8_batch32_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen5_tep8_batch32_eplb0_mtp0.yaml" decode: num-worker: 5 tp: 8 ep: 8 dp-attn: false - conc-list: [450] + recipe: "dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen6_tep8_batch64_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen6_tep8_batch64_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/1k1k/stp/ctx1_gen6_tep8_batch64_eplb0_mtp0.yaml" decode: num-worker: 6 tp: 8 @@ -193,14 +169,12 @@ dsr1-fp4-b200-dynamo-trt: search-space: - spec-decoding: "mtp" conc-list: [90] + recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen1_dep8_batch8_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen1_dep8_batch8_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen1_dep8_batch8_eplb0_mtp3.yaml" decode: num-worker: 1 tp: 8 @@ -208,14 +182,12 @@ dsr1-fp4-b200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [66] + recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch16_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen3_tep8_batch16_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen3_tep8_batch16_eplb0_mtp3.yaml" decode: num-worker: 3 tp: 8 @@ -223,14 +195,12 @@ dsr1-fp4-b200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [6] + recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml" decode: num-worker: 5 tp: 8 @@ -238,14 +208,12 @@ dsr1-fp4-b200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [10, 15, 30, 60] + recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch8_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen5_tep8_batch8_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/mtp/ctx1_gen5_tep8_batch8_eplb0_mtp3.yaml" decode: num-worker: 5 tp: 8 @@ -253,14 +221,12 @@ dsr1-fp4-b200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [548] + recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3.yaml" prefill: num-worker: 3 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3.yaml" decode: num-worker: 1 tp: 8 @@ -268,14 +234,12 @@ dsr1-fp4-b200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [1096, 1691] + recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen1_dep8_batch192_eplb0_mtp1.yaml" prefill: num-worker: 5 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/mtp/ctx5_gen1_dep8_batch192_eplb0_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/mtp/ctx5_gen1_dep8_batch192_eplb0_mtp1.yaml" decode: num-worker: 1 tp: 8 @@ -283,14 +247,12 @@ dsr1-fp4-b200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [658] + recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen2_dep8_batch32_eplb0_mtp3.yaml" prefill: num-worker: 5 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/mtp/ctx5_gen2_dep8_batch32_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/mtp/ctx5_gen2_dep8_batch32_eplb0_mtp3.yaml" decode: num-worker: 2 tp: 8 @@ -299,84 +261,72 @@ dsr1-fp4-b200-dynamo-trt: # Non-MTP configurations - conc-list: [6] + recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml" decode: num-worker: 5 tp: 8 ep: 8 dp-attn: false - conc-list: [10, 15, 25, 50, 100] + recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch8_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/stp/ctx1_gen5_tep8_batch8_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/stp/ctx1_gen5_tep8_batch8_eplb0_mtp0.yaml" decode: num-worker: 5 tp: 8 ep: 8 dp-attn: false - conc-list: [370] + recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx2_gen5_tep8_batch64_eplb0_mtp0.yaml" prefill: num-worker: 2 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/stp/ctx2_gen5_tep8_batch64_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/stp/ctx2_gen5_tep8_batch64_eplb0_mtp0.yaml" decode: num-worker: 5 tp: 8 ep: 8 dp-attn: false - conc-list: [1606] + recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen1_dep8_batch192_eplb0_mtp0.yaml" prefill: num-worker: 4 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/stp/ctx4_gen1_dep8_batch192_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/stp/ctx4_gen1_dep8_batch192_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [837] + recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen3_dep8_batch32_eplb0_mtp0.yaml" prefill: num-worker: 4 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/stp/ctx4_gen3_dep8_batch32_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/stp/ctx4_gen3_dep8_batch32_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 8 ep: 8 dp-attn: true - conc-list: [2222] + recipe: "dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx7_gen2_dep8_batch128_eplb0_mtp0.yaml" prefill: num-worker: 7 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/8k1k/stp/ctx7_gen2_dep8_batch128_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp4/8k1k/stp/ctx7_gen2_dep8_batch128_eplb0_mtp0.yaml" decode: num-worker: 2 tp: 8 @@ -399,14 +349,12 @@ dsr1-fp8-b200-dynamo-trt: # MTP configurations - Low latency (TP attention) - spec-decoding: "mtp" conc-list: [8] + recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_8.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_8.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_8.yaml" decode: num-worker: 8 tp: 8 @@ -414,14 +362,12 @@ dsr1-fp8-b200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [32] + recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch4_eplb0_mtp3_32.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch4_eplb0_mtp3_32.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch4_eplb0_mtp3_32.yaml" decode: num-worker: 8 tp: 8 @@ -429,14 +375,12 @@ dsr1-fp8-b200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [64] + recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch8_eplb0_mtp3_64.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch8_eplb0_mtp3_64.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch8_eplb0_mtp3_64.yaml" decode: num-worker: 8 tp: 8 @@ -444,14 +388,12 @@ dsr1-fp8-b200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [256] + recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch32_eplb0_mtp3_256.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch32_eplb0_mtp3_256.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen8_tp8_batch32_eplb0_mtp3_256.yaml" decode: num-worker: 8 tp: 8 @@ -460,14 +402,12 @@ dsr1-fp8-b200-dynamo-trt: # MTP configurations - High throughput (DP attention) - spec-decoding: "mtp" conc-list: [896] + recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen7_dep8_batch128_eplb0_mtp3_896.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen7_dep8_batch128_eplb0_mtp3_896.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen7_dep8_batch128_eplb0_mtp3_896.yaml" decode: num-worker: 7 tp: 8 @@ -475,14 +415,12 @@ dsr1-fp8-b200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [1024] + recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen4_dep8_batch256_eplb0_mtp3_1024.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen4_dep8_batch256_eplb0_mtp3_1024.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen4_dep8_batch256_eplb0_mtp3_1024.yaml" decode: num-worker: 4 tp: 8 @@ -490,14 +428,12 @@ dsr1-fp8-b200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [1184] + recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen3_dep8_batch384_eplb0_mtp3_1184.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen3_dep8_batch384_eplb0_mtp3_1184.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen3_dep8_batch384_eplb0_mtp3_1184.yaml" decode: num-worker: 3 tp: 8 @@ -505,14 +441,12 @@ dsr1-fp8-b200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [1600] + recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch768_eplb0_mtp2_1600.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen2_dep8_batch768_eplb0_mtp2_1600.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/mtp/ctx1_gen2_dep8_batch768_eplb0_mtp2_1600.yaml" decode: num-worker: 2 tp: 8 @@ -521,42 +455,36 @@ dsr1-fp8-b200-dynamo-trt: # Non-MTP (STP) configurations - Low latency (TP attention) - conc-list: [4] + recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_4.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_4.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_4.yaml" decode: num-worker: 3 tp: 8 ep: 1 dp-attn: false - conc-list: [32] + recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_32.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_32.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_32.yaml" decode: num-worker: 3 tp: 8 ep: 1 dp-attn: false - conc-list: [128] + recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_128.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_128.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_128.yaml" decode: num-worker: 3 tp: 8 @@ -564,42 +492,36 @@ dsr1-fp8-b200-dynamo-trt: dp-attn: false # Non-MTP (STP) configurations - High throughput (DP attention) - conc-list: [1920] + recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen5_dep8_batch48_eplb0_mtp0_1920.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen5_dep8_batch48_eplb0_mtp0_1920.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen5_dep8_batch48_eplb0_mtp0_1920.yaml" decode: num-worker: 5 tp: 8 ep: 8 dp-attn: true - conc-list: [4096] + recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4096.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4096.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4096.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [5152] + recipe: "dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx2_gen5_dep8_batch128_eplb0_mtp0_5152.yaml" prefill: num-worker: 2 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/1k1k/stp/ctx2_gen5_dep8_batch128_eplb0_mtp0_5152.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/1k1k/stp/ctx2_gen5_dep8_batch128_eplb0_mtp0_5152.yaml" decode: num-worker: 5 tp: 8 @@ -612,14 +534,12 @@ dsr1-fp8-b200-dynamo-trt: # MTP configurations - Low latency (TP attention) - spec-decoding: "mtp" conc-list: [8] + recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_8.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_8.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_8.yaml" decode: num-worker: 6 tp: 8 @@ -627,14 +547,12 @@ dsr1-fp8-b200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [8] + recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch32_eplb0_mtp3_8.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen2_tp8_batch32_eplb0_mtp3_8.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen2_tp8_batch32_eplb0_mtp3_8.yaml" decode: num-worker: 2 tp: 8 @@ -642,14 +560,12 @@ dsr1-fp8-b200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [48] + recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_48.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_48.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_48.yaml" decode: num-worker: 6 tp: 8 @@ -657,14 +573,12 @@ dsr1-fp8-b200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [64] + recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch16_eplb0_mtp3_64.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen4_tp8_batch16_eplb0_mtp3_64.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/mtp/ctx1_gen4_tp8_batch16_eplb0_mtp3_64.yaml" decode: num-worker: 4 tp: 8 @@ -673,14 +587,12 @@ dsr1-fp8-b200-dynamo-trt: # MTP configurations - High throughput (DP attention) - spec-decoding: "mtp" conc-list: [224] + recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen3_dep8_batch8_eplb0_mtp3_224.yaml" prefill: num-worker: 2 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/mtp/ctx2_gen3_dep8_batch8_eplb0_mtp3_224.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/mtp/ctx2_gen3_dep8_batch8_eplb0_mtp3_224.yaml" decode: num-worker: 3 tp: 8 @@ -688,14 +600,12 @@ dsr1-fp8-b200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [288] + recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep8_batch32_eplb0_mtp3_288.yaml" prefill: num-worker: 2 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/mtp/ctx2_gen1_dep8_batch32_eplb0_mtp3_288.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/mtp/ctx2_gen1_dep8_batch32_eplb0_mtp3_288.yaml" decode: num-worker: 1 tp: 8 @@ -703,14 +613,12 @@ dsr1-fp8-b200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [1088] + recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep8_batch128_eplb0_mtp2_1088.yaml" prefill: num-worker: 4 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/mtp/ctx4_gen1_dep8_batch128_eplb0_mtp2_1088.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/mtp/ctx4_gen1_dep8_batch128_eplb0_mtp2_1088.yaml" decode: num-worker: 1 tp: 8 @@ -719,56 +627,48 @@ dsr1-fp8-b200-dynamo-trt: # Non-MTP (STP) configurations - Low latency (TP attention) - conc-list: [1] + recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_tp8_batch1_eplb0_mtp0_1.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen1_tp8_batch1_eplb0_mtp0_1.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen1_tp8_batch1_eplb0_mtp0_1.yaml" decode: num-worker: 1 tp: 8 ep: 1 dp-attn: false - conc-list: [32] + recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_32.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_32.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_32.yaml" decode: num-worker: 4 tp: 8 ep: 1 dp-attn: false - conc-list: [128] + recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_128.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_128.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_128.yaml" decode: num-worker: 4 tp: 8 ep: 1 dp-attn: false - conc-list: [96] + recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen6_tp8_batch16_eplb0_mtp0_96.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen6_tp8_batch16_eplb0_mtp0_96.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen6_tp8_batch16_eplb0_mtp0_96.yaml" decode: num-worker: 6 tp: 8 @@ -776,56 +676,48 @@ dsr1-fp8-b200-dynamo-trt: dp-attn: false # Non-MTP (STP) configurations - High throughput (DP attention) - conc-list: [128] + recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch128_eplb0_mtp0_128.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen1_dep8_batch128_eplb0_mtp0_128.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen1_dep8_batch128_eplb0_mtp0_128.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [128] + recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0_128.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0_128.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0_128.yaml" decode: num-worker: 2 tp: 8 ep: 8 dp-attn: true - conc-list: [256] + recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_256.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_256.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_256.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [640] + recipe: "dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx2_gen1_dep8_batch640_eplb0_mtp0_640.yaml" prefill: num-worker: 2 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp8/8k1k/stp/ctx2_gen1_dep8_batch640_eplb0_mtp0_640.yaml - - "CONFIG_FILE=recipes/trtllm/b200-fp8/8k1k/stp/ctx2_gen1_dep8_batch640_eplb0_mtp0_640.yaml" decode: num-worker: 1 tp: 8 @@ -848,14 +740,12 @@ dsr1-fp4-b300-dynamo-trt: search-space: - spec-decoding: "mtp" conc-list: [654] + recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep8_batch64_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen1_dep8_batch64_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen1_dep8_batch64_eplb0_mtp3.yaml" decode: num-worker: 1 tp: 8 @@ -863,14 +753,12 @@ dsr1-fp4-b300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [271] + recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch16_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen2_dep8_batch16_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen2_dep8_batch16_eplb0_mtp3.yaml" decode: num-worker: 2 tp: 8 @@ -878,14 +766,12 @@ dsr1-fp4-b300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [11] + recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml" decode: num-worker: 5 tp: 8 @@ -893,14 +779,12 @@ dsr1-fp4-b300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [10, 20, 25, 60, 120, 200] + recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml" decode: num-worker: 5 tp: 8 @@ -908,14 +792,12 @@ dsr1-fp4-b300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [2342] + recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx2_gen1_dep8_batch256_eplb0_mtp1.yaml" prefill: num-worker: 2 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/mtp/ctx2_gen1_dep8_batch256_eplb0_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/mtp/ctx2_gen1_dep8_batch256_eplb0_mtp1.yaml" decode: num-worker: 1 tp: 8 @@ -923,14 +805,12 @@ dsr1-fp4-b300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [8609] + recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch512_eplb0_mtp1.yaml" prefill: num-worker: 5 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/mtp/ctx5_gen2_dep8_batch512_eplb0_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/mtp/ctx5_gen2_dep8_batch512_eplb0_mtp1.yaml" decode: num-worker: 2 tp: 8 @@ -938,14 +818,12 @@ dsr1-fp4-b300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [12926] + recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch768_eplb0_mtp1.yaml" prefill: num-worker: 5 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/mtp/ctx5_gen2_dep8_batch768_eplb0_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/mtp/ctx5_gen2_dep8_batch768_eplb0_mtp1.yaml" decode: num-worker: 2 tp: 8 @@ -954,98 +832,84 @@ dsr1-fp4-b300-dynamo-trt: # Non-MTP configurations - conc-list: [1176] + recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0.yaml" decode: num-worker: 2 tp: 8 ep: 8 dp-attn: true - conc-list: [6] + recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [5, 10, 15, 25] + recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml" decode: num-worker: 5 tp: 4 ep: 4 dp-attn: false - conc-list: [60, 110, 195, 395] + recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch64_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen5_tep8_batch64_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/stp/ctx1_gen5_tep8_batch64_eplb0_mtp0.yaml" decode: num-worker: 5 tp: 8 ep: 8 dp-attn: false - conc-list: [4405] + recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch512_eplb0_mtp0.yaml" prefill: num-worker: 2 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/stp/ctx2_gen1_dep8_batch512_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/stp/ctx2_gen1_dep8_batch512_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [8192] + recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen1_dep8_batch1024_eplb0_mtp0.yaml" prefill: num-worker: 3 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/stp/ctx3_gen1_dep8_batch1024_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/stp/ctx3_gen1_dep8_batch1024_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [4611] + recipe: "dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen2_dep8_batch256_eplb0_mtp0.yaml" prefill: num-worker: 3 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/1k1k/stp/ctx3_gen2_dep8_batch256_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/1k1k/stp/ctx3_gen2_dep8_batch256_eplb0_mtp0.yaml" decode: num-worker: 2 tp: 8 @@ -1057,14 +921,12 @@ dsr1-fp4-b300-dynamo-trt: search-space: - spec-decoding: "mtp" conc-list: [2198] + recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml" prefill: num-worker: 10 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml" decode: num-worker: 1 tp: 8 @@ -1072,14 +934,12 @@ dsr1-fp4-b300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [52] + recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep4_batch8_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/mtp/ctx1_gen4_tep4_batch8_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/mtp/ctx1_gen4_tep4_batch8_eplb0_mtp3.yaml" decode: num-worker: 4 tp: 4 @@ -1087,14 +947,12 @@ dsr1-fp4-b300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [8] + recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml" decode: num-worker: 4 tp: 8 @@ -1102,14 +960,12 @@ dsr1-fp4-b300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [32] + recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml" decode: num-worker: 4 tp: 8 @@ -1117,14 +973,12 @@ dsr1-fp4-b300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [181] + recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch16_eplb0_mtp3.yaml" prefill: num-worker: 3 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/mtp/ctx3_gen1_dep8_batch16_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/mtp/ctx3_gen1_dep8_batch16_eplb0_mtp3.yaml" decode: num-worker: 1 tp: 8 @@ -1132,14 +986,12 @@ dsr1-fp4-b300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [1197] + recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx9_gen1_dep8_batch128_eplb0_mtp1.yaml" prefill: num-worker: 9 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/mtp/ctx9_gen1_dep8_batch128_eplb0_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/mtp/ctx9_gen1_dep8_batch128_eplb0_mtp1.yaml" decode: num-worker: 1 tp: 8 @@ -1148,98 +1000,84 @@ dsr1-fp4-b300-dynamo-trt: # Non-MTP configurations - conc-list: [105] + recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep4_batch32_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen3_tep4_batch32_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen3_tep4_batch32_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 4 ep: 4 dp-attn: false - conc-list: [63] + recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 8 ep: 8 dp-attn: false - conc-list: [4] + recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 8 ep: 8 dp-attn: false - conc-list: [12] + recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen4_tep4_batch2_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen4_tep4_batch2_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/stp/ctx1_gen4_tep4_batch2_eplb0_mtp0.yaml" decode: num-worker: 4 tp: 4 ep: 4 dp-attn: false - conc-list: [589] + recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx5_gen2_dep8_batch32_eplb0_mtp0.yaml" prefill: num-worker: 5 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/stp/ctx5_gen2_dep8_batch32_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/stp/ctx5_gen2_dep8_batch32_eplb0_mtp0.yaml" decode: num-worker: 2 tp: 8 ep: 8 dp-attn: true - conc-list: [1093] + recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx6_gen1_dep8_batch128_eplb0_mtp0.yaml" prefill: num-worker: 6 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/stp/ctx6_gen1_dep8_batch128_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/stp/ctx6_gen1_dep8_batch128_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [2048] + recipe: "dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx8_gen1_dep8_batch256_eplb0_mtp0.yaml" prefill: num-worker: 8 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp4/8k1k/stp/ctx8_gen1_dep8_batch256_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp4/8k1k/stp/ctx8_gen1_dep8_batch256_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 8 @@ -1262,14 +1100,12 @@ dsr1-fp8-b300-dynamo-trt: search-space: - spec-decoding: "mtp" conc-list: [10] + recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_10.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_10.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_10.yaml" decode: num-worker: 8 tp: 8 @@ -1277,14 +1113,12 @@ dsr1-fp8-b300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [160] + recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch16_eplb0_mtp3_160.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen8_tp8_batch16_eplb0_mtp3_160.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen8_tp8_batch16_eplb0_mtp3_160.yaml" decode: num-worker: 8 tp: 8 @@ -1292,14 +1126,12 @@ dsr1-fp8-b300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [3072] + recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml" decode: num-worker: 1 tp: 8 @@ -1307,14 +1139,12 @@ dsr1-fp8-b300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [2560] + recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch128_eplb0_mtp1_2560.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen2_dep8_batch128_eplb0_mtp1_2560.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen2_dep8_batch128_eplb0_mtp1_2560.yaml" decode: num-worker: 2 tp: 8 @@ -1322,14 +1152,12 @@ dsr1-fp8-b300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [720] + recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp2_720.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp2_720.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp2_720.yaml" decode: num-worker: 5 tp: 8 @@ -1337,14 +1165,12 @@ dsr1-fp8-b300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [11264] + recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx3_gen2_dp8_batch512_eplb0_mtp1_11264.yaml" prefill: num-worker: 3 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/mtp/ctx3_gen2_dp8_batch512_eplb0_mtp1_11264.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/mtp/ctx3_gen2_dp8_batch512_eplb0_mtp1_11264.yaml" decode: num-worker: 2 tp: 8 @@ -1355,98 +1181,84 @@ dsr1-fp8-b300-dynamo-trt: osl: 1024 search-space: - conc-list: [2112] + recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_2112.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_2112.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_2112.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [3072] + recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen2_dp8_batch128_eplb0_mtp0_3072.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen2_dp8_batch128_eplb0_mtp0_3072.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen2_dp8_batch128_eplb0_mtp0_3072.yaml" decode: num-worker: 2 tp: 8 ep: 1 dp-attn: true - conc-list: [1280] + recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen3_dp8_batch48_eplb0_mtp0_1280.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen3_dp8_batch48_eplb0_mtp0_1280.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen3_dp8_batch48_eplb0_mtp0_1280.yaml" decode: num-worker: 3 tp: 8 ep: 1 dp-attn: true - conc-list: [12] + recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_12.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_12.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_12.yaml" decode: num-worker: 8 tp: 8 ep: 1 dp-attn: false - conc-list: [128] + recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_128.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_128.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_128.yaml" decode: num-worker: 8 tp: 8 ep: 1 dp-attn: false - conc-list: [384] + recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_384.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_384.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_384.yaml" decode: num-worker: 8 tp: 8 ep: 1 dp-attn: false - conc-list: [16384] + recipe: "dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx2_gen1_dp8_batch1024_eplb0_mtp0_16384.yaml" prefill: num-worker: 2 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/1k1k/stp/ctx2_gen1_dp8_batch1024_eplb0_mtp0_16384.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/1k1k/stp/ctx2_gen1_dp8_batch1024_eplb0_mtp0_16384.yaml" decode: num-worker: 1 tp: 8 @@ -1458,14 +1270,12 @@ dsr1-fp8-b300-dynamo-trt: search-space: - spec-decoding: "mtp" conc-list: [40] + recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch16_eplb0_mtp3_40.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen2_tp8_batch16_eplb0_mtp3_40.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen2_tp8_batch16_eplb0_mtp3_40.yaml" decode: num-worker: 2 tp: 8 @@ -1473,14 +1283,12 @@ dsr1-fp8-b300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [8] + recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch1_eplb0_mtp3_8.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen4_tp8_batch1_eplb0_mtp3_8.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen4_tp8_batch1_eplb0_mtp3_8.yaml" decode: num-worker: 4 tp: 8 @@ -1488,14 +1296,12 @@ dsr1-fp8-b300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [20] + recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch4_eplb0_mtp3_20.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen4_tp8_batch4_eplb0_mtp3_20.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen4_tp8_batch4_eplb0_mtp3_20.yaml" decode: num-worker: 4 tp: 8 @@ -1503,14 +1309,12 @@ dsr1-fp8-b300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [72] + recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen1_dp8_batch8_eplb0_mtp3_72.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen1_dp8_batch8_eplb0_mtp3_72.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/mtp/ctx1_gen1_dp8_batch8_eplb0_mtp3_72.yaml" decode: num-worker: 1 tp: 8 @@ -1518,14 +1322,12 @@ dsr1-fp8-b300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [144] + recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx2_gen1_dp8_batch16_eplb0_mtp3_144.yaml" prefill: num-worker: 2 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/mtp/ctx2_gen1_dp8_batch16_eplb0_mtp3_144.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/mtp/ctx2_gen1_dp8_batch16_eplb0_mtp3_144.yaml" decode: num-worker: 1 tp: 8 @@ -1533,14 +1335,12 @@ dsr1-fp8-b300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [512] + recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx4_gen1_dp8_batch64_eplb0_mtp2_512.yaml" prefill: num-worker: 4 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/mtp/ctx4_gen1_dp8_batch64_eplb0_mtp2_512.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/mtp/ctx4_gen1_dp8_batch64_eplb0_mtp2_512.yaml" decode: num-worker: 1 tp: 8 @@ -1551,98 +1351,84 @@ dsr1-fp8-b300-dynamo-trt: osl: 1024 search-space: - conc-list: [64] + recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch16_eplb0_mtp0_64.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/stp/ctx1_gen4_tp8_batch16_eplb0_mtp0_64.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/stp/ctx1_gen4_tp8_batch16_eplb0_mtp0_64.yaml" decode: num-worker: 4 tp: 8 ep: 1 dp-attn: false - conc-list: [16] + recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen8_tp8_batch2_eplb0_mtp0_16.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/stp/ctx1_gen8_tp8_batch2_eplb0_mtp0_16.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/stp/ctx1_gen8_tp8_batch2_eplb0_mtp0_16.yaml" decode: num-worker: 8 tp: 8 ep: 1 dp-attn: false - conc-list: [256] + recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx2_gen1_dp8_batch32_eplb0_mtp0_256.yaml" prefill: num-worker: 2 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/stp/ctx2_gen1_dp8_batch32_eplb0_mtp0_256.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/stp/ctx2_gen1_dp8_batch32_eplb0_mtp0_256.yaml" decode: num-worker: 1 tp: 8 ep: 1 dp-attn: true - conc-list: [512] + recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen1_dp8_batch64_eplb0_mtp0_512.yaml" prefill: num-worker: 3 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/stp/ctx3_gen1_dp8_batch64_eplb0_mtp0_512.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/stp/ctx3_gen1_dp8_batch64_eplb0_mtp0_512.yaml" decode: num-worker: 1 tp: 8 ep: 1 dp-attn: true - conc-list: [256] + recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen5_tp8_batch64_eplb0_mtp0_256.yaml" prefill: num-worker: 3 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/stp/ctx3_gen5_tp8_batch64_eplb0_mtp0_256.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/stp/ctx3_gen5_tp8_batch64_eplb0_mtp0_256.yaml" decode: num-worker: 5 tp: 8 ep: 1 dp-attn: false - conc-list: [1075] + recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx5_gen1_dp8_batch128_eplb0_mtp0_1075.yaml" prefill: num-worker: 5 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/stp/ctx5_gen1_dp8_batch128_eplb0_mtp0_1075.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/stp/ctx5_gen1_dp8_batch128_eplb0_mtp0_1075.yaml" decode: num-worker: 1 tp: 8 ep: 1 dp-attn: true - conc-list: [3072] + recipe: "dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch384_eplb0_mtp0_3072.yaml" prefill: num-worker: 7 tp: 4 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b300-fp8/8k1k/stp/ctx7_gen1_dep8_batch384_eplb0_mtp0_3072.yaml - - "CONFIG_FILE=recipes/trtllm/b300-fp8/8k1k/stp/ctx7_gen1_dep8_batch384_eplb0_mtp0_3072.yaml" decode: num-worker: 1 tp: 8 @@ -2676,14 +2462,12 @@ dsr1-fp8-h200-dynamo-trt: # MTP configurations - spec-decoding: "mtp" conc-list: [1] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c1_ctx1_gen11_tep8_batch1_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c1_ctx1_gen11_tep8_batch1_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c1_ctx1_gen11_tep8_batch1_eplb0_mtp3.yaml" decode: num-worker: 11 tp: 8 @@ -2691,14 +2475,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [4] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c4_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c4_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c4_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml" decode: num-worker: 11 tp: 8 @@ -2706,14 +2488,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [8] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c8_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c8_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c8_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml" decode: num-worker: 11 tp: 8 @@ -2721,14 +2501,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [16] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c16_ctx1_gen9_tep8_batch128_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c16_ctx1_gen9_tep8_batch128_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c16_ctx1_gen9_tep8_batch128_eplb0_mtp3.yaml" decode: num-worker: 9 tp: 8 @@ -2736,14 +2514,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [32] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c32_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c32_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c32_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml" decode: num-worker: 11 tp: 8 @@ -2751,14 +2527,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [64] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c64_ctx1_gen8_dep8_batch128_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c64_ctx1_gen8_dep8_batch128_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c64_ctx1_gen8_dep8_batch128_eplb0_mtp3.yaml" decode: num-worker: 8 tp: 8 @@ -2766,14 +2540,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [128] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c128_ctx1_gen7_dep8_batch128_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c128_ctx1_gen7_dep8_batch128_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c128_ctx1_gen7_dep8_batch128_eplb0_mtp3.yaml" decode: num-worker: 7 tp: 8 @@ -2781,14 +2553,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [256] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c256_ctx1_gen4_dep8_batch128_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c256_ctx1_gen4_dep8_batch128_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c256_ctx1_gen4_dep8_batch128_eplb0_mtp3.yaml" decode: num-worker: 4 tp: 8 @@ -2796,14 +2566,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [512] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c512_ctx1_gen2_dep8_batch256_eplb0_mtp1.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/mtp/c512_ctx1_gen2_dep8_batch256_eplb0_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/mtp/c512_ctx1_gen2_dep8_batch256_eplb0_mtp1.yaml" decode: num-worker: 2 tp: 8 @@ -2811,126 +2579,108 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: true # Non-MTP configurations (STP) - conc-list: [1] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c1_ctx1_gen9_tep8_batch1_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c1_ctx1_gen9_tep8_batch1_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c1_ctx1_gen9_tep8_batch1_eplb0_mtp0.yaml" decode: num-worker: 9 tp: 8 ep: 8 dp-attn: false - conc-list: [4] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c4_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c4_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c4_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml" decode: num-worker: 9 tp: 8 ep: 8 dp-attn: false - conc-list: [8] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c8_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c8_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c8_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml" decode: num-worker: 9 tp: 8 ep: 8 dp-attn: false - conc-list: [16] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c16_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c16_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c16_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml" decode: num-worker: 9 tp: 8 ep: 8 dp-attn: false - conc-list: [32] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c32_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c32_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c32_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml" decode: num-worker: 9 tp: 8 ep: 8 dp-attn: false - conc-list: [64] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c64_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c64_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c64_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml" decode: num-worker: 9 tp: 8 ep: 8 dp-attn: false - conc-list: [128] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c128_ctx1_gen9_dep8_batch512_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c128_ctx1_gen9_dep8_batch512_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c128_ctx1_gen9_dep8_batch512_eplb0_mtp0.yaml" decode: num-worker: 9 tp: 8 ep: 8 dp-attn: true - conc-list: [256] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c256_ctx1_gen6_dep8_batch512_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c256_ctx1_gen6_dep8_batch512_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c256_ctx1_gen6_dep8_batch512_eplb0_mtp0.yaml" decode: num-worker: 6 tp: 8 ep: 8 dp-attn: true - conc-list: [512] + recipe: "dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c512_ctx2_gen7_dep8_batch512_eplb0_mtp0.yaml" prefill: num-worker: 2 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/1k1k/stp/c512_ctx2_gen7_dep8_batch512_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/1k1k/stp/c512_ctx2_gen7_dep8_batch512_eplb0_mtp0.yaml" decode: num-worker: 7 tp: 8 @@ -2942,14 +2692,12 @@ dsr1-fp8-h200-dynamo-trt: # MTP configurations - spec-decoding: "mtp" conc-list: [1] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp3.yaml" decode: num-worker: 7 tp: 8 @@ -2957,14 +2705,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [4] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp3.yaml" decode: num-worker: 7 tp: 8 @@ -2972,14 +2718,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [8] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c8_ctx1_gen6_tep8_batch32_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c8_ctx1_gen6_tep8_batch32_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c8_ctx1_gen6_tep8_batch32_eplb0_mtp3.yaml" decode: num-worker: 6 tp: 8 @@ -2987,14 +2731,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [16] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp2.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp2.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp2.yaml" decode: num-worker: 3 tp: 8 @@ -3002,14 +2744,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [32] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c32_ctx3_gen5_tep8_batch32_eplb0_mtp3.yaml" prefill: num-worker: 3 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c32_ctx3_gen5_tep8_batch32_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c32_ctx3_gen5_tep8_batch32_eplb0_mtp3.yaml" decode: num-worker: 5 tp: 8 @@ -3017,14 +2757,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [64] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c64_ctx1_gen1_dep8_batch32_eplb0_mtp2.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c64_ctx1_gen1_dep8_batch32_eplb0_mtp2.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c64_ctx1_gen1_dep8_batch32_eplb0_mtp2.yaml" decode: num-worker: 1 tp: 8 @@ -3032,14 +2770,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [128] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c128_ctx2_gen1_dep8_batch32_eplb0_mtp2.yaml" prefill: num-worker: 2 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c128_ctx2_gen1_dep8_batch32_eplb0_mtp2.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c128_ctx2_gen1_dep8_batch32_eplb0_mtp2.yaml" decode: num-worker: 1 tp: 8 @@ -3047,14 +2783,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [256] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c256_ctx3_gen1_dep8_batch32_eplb0_mtp2.yaml" prefill: num-worker: 3 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c256_ctx3_gen1_dep8_batch32_eplb0_mtp2.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c256_ctx3_gen1_dep8_batch32_eplb0_mtp2.yaml" decode: num-worker: 1 tp: 8 @@ -3062,14 +2796,12 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [512] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c512_ctx3_gen1_dep8_batch64_eplb0_mtp1.yaml" prefill: num-worker: 3 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/mtp/c512_ctx3_gen1_dep8_batch64_eplb0_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/mtp/c512_ctx3_gen1_dep8_batch64_eplb0_mtp1.yaml" decode: num-worker: 1 tp: 8 @@ -3077,126 +2809,108 @@ dsr1-fp8-h200-dynamo-trt: dp-attn: true # Non-MTP configurations (STP) - conc-list: [1] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp0.yaml" decode: num-worker: 7 tp: 8 ep: 8 dp-attn: false - conc-list: [4] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp0.yaml" decode: num-worker: 7 tp: 8 ep: 8 dp-attn: false - conc-list: [8] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c8_ctx1_gen6_tep8_batch16_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c8_ctx1_gen6_tep8_batch16_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c8_ctx1_gen6_tep8_batch16_eplb0_mtp0.yaml" decode: num-worker: 6 tp: 8 ep: 8 dp-attn: false - conc-list: [16] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 8 ep: 8 dp-attn: false - conc-list: [32] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c32_ctx2_gen5_tep8_batch128_eplb0_mtp0.yaml" prefill: num-worker: 2 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c32_ctx2_gen5_tep8_batch128_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c32_ctx2_gen5_tep8_batch128_eplb0_mtp0.yaml" decode: num-worker: 5 tp: 8 ep: 8 dp-attn: false - conc-list: [64] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c64_ctx2_gen3_dep8_batch128_eplb0_mtp0.yaml" prefill: num-worker: 2 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c64_ctx2_gen3_dep8_batch128_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c64_ctx2_gen3_dep8_batch128_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 8 ep: 8 dp-attn: true - conc-list: [128] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c128_ctx1_gen1_dep8_batch256_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c128_ctx1_gen1_dep8_batch256_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c128_ctx1_gen1_dep8_batch256_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [256] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c256_ctx5_gen3_dep8_batch256_eplb0_mtp0.yaml" prefill: num-worker: 5 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c256_ctx5_gen3_dep8_batch256_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c256_ctx5_gen3_dep8_batch256_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 8 ep: 8 dp-attn: true - conc-list: [512] + recipe: "dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c512_ctx3_gen1_dep8_batch512_eplb0_mtp0.yaml" prefill: num-worker: 3 tp: 8 ep: 8 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h200/8k1k/stp/c512_ctx3_gen1_dep8_batch512_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h200/8k1k/stp/c512_ctx3_gen1_dep8_batch512_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 8 @@ -3219,14 +2933,12 @@ dsr1-fp8-h100-dynamo-trt: # MTP configurations - spec-decoding: "mtp" conc-list: [6] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml" decode: num-worker: 3 tp: 16 @@ -3234,14 +2946,12 @@ dsr1-fp8-h100-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [9] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml" decode: num-worker: 3 tp: 16 @@ -3249,14 +2959,12 @@ dsr1-fp8-h100-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [30] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml" decode: num-worker: 3 tp: 16 @@ -3264,14 +2972,12 @@ dsr1-fp8-h100-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [60] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch16_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch16_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch16_eplb0_mtp3.yaml" decode: num-worker: 3 tp: 16 @@ -3279,14 +2985,12 @@ dsr1-fp8-h100-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [117] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch32_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch32_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch32_eplb0_mtp3.yaml" decode: num-worker: 3 tp: 16 @@ -3294,14 +2998,12 @@ dsr1-fp8-h100-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [231] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_dep16_batch4_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_dep16_batch4_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_dep16_batch4_eplb0_mtp3.yaml" decode: num-worker: 3 tp: 16 @@ -3309,14 +3011,12 @@ dsr1-fp8-h100-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [462] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch128_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch128_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen3_tep16_batch128_eplb0_mtp3.yaml" decode: num-worker: 3 tp: 16 @@ -3324,14 +3024,12 @@ dsr1-fp8-h100-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [615] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp2.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp2.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp2.yaml" decode: num-worker: 1 tp: 16 @@ -3339,14 +3037,12 @@ dsr1-fp8-h100-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [1229] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1.yaml" decode: num-worker: 1 tp: 16 @@ -3354,126 +3050,108 @@ dsr1-fp8-h100-dynamo-trt: dp-attn: true # Non-MTP configurations (STP) - conc-list: [6] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 16 ep: 16 dp-attn: false - conc-list: [9] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 16 ep: 16 dp-attn: false - conc-list: [30] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 16 ep: 16 dp-attn: false - conc-list: [60] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch16_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch16_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_tep16_batch16_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 16 ep: 16 dp-attn: false - conc-list: [231] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch4_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch4_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch4_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 16 ep: 16 dp-attn: true - conc-list: [462] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch8_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch8_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch8_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 16 ep: 16 dp-attn: true - conc-list: [924] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch16_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch16_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch16_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 16 ep: 16 dp-attn: true - conc-list: [1845] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch32_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch32_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx1_gen3_dep16_batch32_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 16 ep: 16 dp-attn: true - conc-list: [4916] + recipe: "dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb0_mtp0.yaml" prefill: num-worker: 2 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/1k1k/stp/ctx2_gen1_dep16_batch256_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/1k1k/stp/ctx2_gen1_dep16_batch256_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 16 @@ -3485,14 +3163,12 @@ dsr1-fp8-h100-dynamo-trt: # MTP configurations (6 points) - spec-decoding: "mtp" conc-list: [6] + recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml" decode: num-worker: 3 tp: 16 @@ -3500,14 +3176,12 @@ dsr1-fp8-h100-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [9] + recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml" decode: num-worker: 3 tp: 16 @@ -3515,14 +3189,12 @@ dsr1-fp8-h100-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [30] + recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml" decode: num-worker: 3 tp: 16 @@ -3530,14 +3202,12 @@ dsr1-fp8-h100-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [77] + recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen1_dep16_batch4_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen1_dep16_batch4_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen1_dep16_batch4_eplb0_mtp3.yaml" decode: num-worker: 1 tp: 16 @@ -3547,14 +3217,12 @@ dsr1-fp8-h100-dynamo-trt: # https://github.com/InferenceMAX/InferenceMAX/actions/runs/21769314582/job/62813105509 # - spec-decoding: "mtp" # conc-list: [78] + # recipe: "trtllm/h100-fp8/8k1k/mtp/ctx1_gen2_tep16_batch32_eplb0_mtp3.yaml" # prefill: # num-worker: 1 # tp: 16 # ep: 16 # dp-attn: true - # additional-settings: - # # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen2_tep16_batch32_eplb0_mtp3.yaml - # - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/mtp/ctx1_gen2_tep16_batch32_eplb0_mtp3.yaml" # decode: # num-worker: 2 # tp: 16 @@ -3562,14 +3230,12 @@ dsr1-fp8-h100-dynamo-trt: # dp-attn: false - spec-decoding: "mtp" conc-list: [154] + recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx2_gen1_dep16_batch8_eplb0_mtp3.yaml" prefill: num-worker: 2 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/mtp/ctx2_gen1_dep16_batch8_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/mtp/ctx2_gen1_dep16_batch8_eplb0_mtp3.yaml" decode: num-worker: 1 tp: 16 @@ -3577,70 +3243,60 @@ dsr1-fp8-h100-dynamo-trt: dp-attn: true # STP configurations (5 points) - conc-list: [6] + recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 16 ep: 16 dp-attn: false - conc-list: [9] + recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 16 ep: 16 dp-attn: false - conc-list: [30] + recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 16 ep: 16 dp-attn: false - conc-list: [154] + recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen2_tep16_batch64_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen2_tep16_batch64_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/stp/ctx1_gen2_tep16_batch64_eplb0_mtp0.yaml" decode: num-worker: 2 tp: 16 ep: 16 dp-attn: false - conc-list: [308] + recipe: "dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx2_gen1_dep16_batch16_eplb0_mtp0.yaml" prefill: num-worker: 2 tp: 16 ep: 16 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/h100-fp8/8k1k/stp/ctx2_gen1_dep16_batch16_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/h100-fp8/8k1k/stp/ctx2_gen1_dep16_batch16_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 16 @@ -3860,13 +3516,12 @@ dsr1-fp8-h100-dynamo-sglang: search-space: # # STP: Max throughput TEP (1 prefill, 2 decode) # - conc-list: [1, 2, 4, 8, 16, 32, 64, 128] + # recipe: "h100/1k1k/stp/h100-fp8-1p2d-max-tp.yaml" # prefill: # num-worker: 1 # tp: 16 # ep: 1 # dp-attn: false - # additional-settings: - # - "CONFIG_FILE=recipes/h100/1k1k/stp/h100-fp8-1p2d-max-tp.yaml" # decode: # num-worker: 2 # tp: 16 @@ -3874,13 +3529,12 @@ dsr1-fp8-h100-dynamo-sglang: # dp-attn: false # # STP: Max throughput DEP (1 prefill, 1 decode, dp-attention) # - conc-list: [1, 2, 4, 8, 16, 32, 64] + # recipe: "h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml" # prefill: # num-worker: 1 # tp: 16 # ep: 1 # dp-attn: false - # additional-settings: - # - "CONFIG_FILE=recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml" # decode: # num-worker: 1 # tp: 16 @@ -3889,13 +3543,12 @@ dsr1-fp8-h100-dynamo-sglang: # MTP: Max throughput TEP (1 prefill, 2 decode) - spec-decoding: "mtp" conc-list: [1, 2, 4, 8, 16, 32, 64, 128] + recipe: "dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p2d-max-tp-mtp.yaml" prefill: num-worker: 1 tp: 16 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h100/1k1k/mtp/h100-fp8-1p2d-max-tp-mtp.yaml" decode: num-worker: 2 tp: 16 @@ -3904,13 +3557,12 @@ dsr1-fp8-h100-dynamo-sglang: # MTP: Max throughput DEP (1 prefill, 1 decode, dp-attention) - spec-decoding: "mtp" conc-list: [1, 2, 4, 8, 16, 32, 64] + recipe: "dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml" prefill: num-worker: 1 tp: 16 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h100/1k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml" decode: num-worker: 1 tp: 16 @@ -3921,13 +3573,12 @@ dsr1-fp8-h100-dynamo-sglang: search-space: # # STP: Max throughput TEP (1 prefill, 1 decode) # - conc-list: [1, 2, 4, 8, 16, 32, 64, 128] + # recipe: "h100/8k1k/stp/h100-fp8-1p1d-max-tp.yaml" # prefill: # num-worker: 1 # tp: 16 # ep: 1 # dp-attn: false - # additional-settings: - # - "CONFIG_FILE=recipes/h100/8k1k/stp/h100-fp8-1p1d-max-tp.yaml" # decode: # num-worker: 1 # tp: 16 @@ -3935,13 +3586,12 @@ dsr1-fp8-h100-dynamo-sglang: # dp-attn: false # # STP: Max throughput DEP (1 prefill, 1 decode, dp-attention) # - conc-list: [1, 2, 4, 8, 16, 32, 64] + # recipe: "h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml" # prefill: # num-worker: 1 # tp: 16 # ep: 1 # dp-attn: false - # additional-settings: - # - "CONFIG_FILE=recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml" # decode: # num-worker: 1 # tp: 16 @@ -3950,13 +3600,12 @@ dsr1-fp8-h100-dynamo-sglang: # MTP: Max throughput TEP (1 prefill, 1 decode) - spec-decoding: "mtp" conc-list: [1, 2, 4, 8, 16, 32, 64, 128] + recipe: "dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-tp-mtp.yaml" prefill: num-worker: 1 tp: 16 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yaml" decode: num-worker: 1 tp: 16 @@ -3965,13 +3614,12 @@ dsr1-fp8-h100-dynamo-sglang: # MTP: Max throughput DEP (1 prefill, 1 decode, dp-attention) - spec-decoding: "mtp" conc-list: [1, 2, 4, 8, 16, 32, 64] + recipe: "dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml" prefill: num-worker: 1 tp: 16 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml" decode: num-worker: 1 tp: 16 @@ -4061,14 +3709,12 @@ dsr1-fp4-gb200-dynamo-trt: # MTP configurations (spec_decoding="mtp") - spec-decoding: "mtp" conc-list: [ 180 ] + recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3.yaml" decode: num-worker: 1 tp: 32 @@ -4076,14 +3722,12 @@ dsr1-fp4-gb200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [ 4, 8, 12, 24, 48 ] + recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml" decode: num-worker: 4 tp: 8 @@ -4091,14 +3735,12 @@ dsr1-fp4-gb200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [ 4301 ] + recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx2_gen1_dep16_batch256_eplb256_mtp1.yaml" prefill: num-worker: 2 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/mtp/ctx2_gen1_dep16_batch256_eplb256_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/mtp/ctx2_gen1_dep16_batch256_eplb256_mtp1.yaml" decode: num-worker: 1 tp: 16 @@ -4106,14 +3748,12 @@ dsr1-fp4-gb200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [ 2253 ] + recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch64_eplb288_mtp1.yaml" prefill: num-worker: 3 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/mtp/ctx3_gen1_dep32_batch64_eplb288_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/mtp/ctx3_gen1_dep32_batch64_eplb288_mtp1.yaml" decode: num-worker: 1 tp: 32 @@ -4121,14 +3761,12 @@ dsr1-fp4-gb200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [ 16130 ] + recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch768_eplb0_mtp1.yaml" prefill: num-worker: 3 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/mtp/ctx3_gen5_dep4_batch768_eplb0_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/mtp/ctx3_gen5_dep4_batch768_eplb0_mtp1.yaml" decode: num-worker: 5 tp: 4 @@ -4138,98 +3776,84 @@ dsr1-fp4-gb200-dynamo-trt: # Non-MTP configurations (default spec_decoding="none") - conc-list: [ 4301 ] + recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [ 666 ] + recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep32_batch16_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen1_dep32_batch16_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen1_dep32_batch16_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 32 ep: 32 dp-attn: true - conc-list: [ 6144 ] + recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen2_dep4_batch768_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen2_dep4_batch768_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen2_dep4_batch768_eplb0_mtp0.yaml" decode: num-worker: 2 tp: 4 ep: 4 dp-attn: true - conc-list: [ 12, 24, 48, 96, 192 ] + recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [ 5 ] + recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [ 4301 ] + recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb256_mtp0.yaml" prefill: num-worker: 2 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/stp/ctx2_gen1_dep16_batch256_eplb256_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/stp/ctx2_gen1_dep16_batch256_eplb256_mtp0.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [ 2253 ] + recipe: "dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch64_eplb0_mtp0.yaml" prefill: num-worker: 2 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/1k1k/stp/ctx2_gen1_dep32_batch64_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/1k1k/stp/ctx2_gen1_dep32_batch64_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 32 @@ -4242,14 +3866,12 @@ dsr1-fp4-gb200-dynamo-trt: # MTP configurations (spec_decoding="mtp") - spec-decoding: "mtp" conc-list: [ 4, 8, 12, 24, 48 ] + recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml" decode: num-worker: 4 tp: 8 @@ -4257,14 +3879,12 @@ dsr1-fp4-gb200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [ 180 ] + recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep32_batch4_eplb0_mtp3.yaml" prefill: num-worker: 3 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/mtp/ctx3_gen1_dep32_batch4_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/mtp/ctx3_gen1_dep32_batch4_eplb0_mtp3.yaml" decode: num-worker: 1 tp: 32 @@ -4272,14 +3892,12 @@ dsr1-fp4-gb200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [ 1229 ] + recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx7_gen1_dep16_batch64_eplb256_mtp1.yaml" prefill: num-worker: 7 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/mtp/ctx7_gen1_dep16_batch64_eplb256_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/mtp/ctx7_gen1_dep16_batch64_eplb256_mtp1.yaml" decode: num-worker: 1 tp: 16 @@ -4287,14 +3905,12 @@ dsr1-fp4-gb200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [ 666 ] + recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch16_eplb0_mtp3.yaml" prefill: num-worker: 8 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/mtp/ctx8_gen1_dep32_batch16_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/mtp/ctx8_gen1_dep32_batch16_eplb0_mtp3.yaml" decode: num-worker: 1 tp: 32 @@ -4302,14 +3918,12 @@ dsr1-fp4-gb200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [ 4301 ] + recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx11_gen1_dep16_batch256_eplb256_mtp1.yaml" prefill: num-worker: 11 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/mtp/ctx11_gen1_dep16_batch256_eplb256_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/mtp/ctx11_gen1_dep16_batch256_eplb256_mtp1.yaml" decode: num-worker: 1 tp: 16 @@ -4318,84 +3932,72 @@ dsr1-fp4-gb200-dynamo-trt: # Non-MTP configurations (default spec_decoding="none") - conc-list: [ 12, 44, 76 ] + recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [ 5 ] + recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [ 333 ] + recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0.yaml" prefill: num-worker: 2 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 32 ep: 32 dp-attn: true - conc-list: [ 1229 ] + recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch32_eplb0_mtp0.yaml" prefill: num-worker: 7 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/stp/ctx7_gen1_dep32_batch32_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/stp/ctx7_gen1_dep32_batch32_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 32 ep: 32 dp-attn: true - conc-list: [ 2253 ] + recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx8_gen1_dep16_batch128_eplb0_mtp0.yaml" prefill: num-worker: 8 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/stp/ctx8_gen1_dep16_batch128_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/stp/ctx8_gen1_dep16_batch128_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [ 4096 ] + recipe: "dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx10_gen1_dep16_batch256_eplb256_mtp0.yaml" prefill: num-worker: 10 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp4/8k1k/stp/ctx10_gen1_dep16_batch256_eplb256_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp4/8k1k/stp/ctx10_gen1_dep16_batch256_eplb256_mtp0.yaml" decode: num-worker: 1 tp: 16 @@ -4419,14 +4021,12 @@ dsr1-fp8-gb200-dynamo-trt: search-space: - spec-decoding: "mtp" conc-list: [4301] + recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch512_eplb0_mtp1_4301.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep8_batch512_eplb0_mtp1_4301.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep8_batch512_eplb0_mtp1_4301.yaml" decode: num-worker: 1 tp: 8 @@ -4434,14 +4034,12 @@ dsr1-fp8-gb200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [2151] + recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch256_eplb0_mtp1_2151.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep8_batch256_eplb0_mtp1_2151.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep8_batch256_eplb0_mtp1_2151.yaml" decode: num-worker: 1 tp: 8 @@ -4449,14 +4047,12 @@ dsr1-fp8-gb200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [1229] + recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1_1229.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1_1229.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1_1229.yaml" decode: num-worker: 1 tp: 16 @@ -4464,14 +4060,12 @@ dsr1-fp8-gb200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [615] + recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch16_eplb0_mtp3_615.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep32_batch16_eplb0_mtp3_615.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen1_dep32_batch16_eplb0_mtp3_615.yaml" decode: num-worker: 1 tp: 32 @@ -4479,14 +4073,12 @@ dsr1-fp8-gb200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [36] + recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3_36.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3_36.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3_36.yaml" decode: num-worker: 3 tp: 8 @@ -4494,14 +4086,12 @@ dsr1-fp8-gb200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [18] + recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_18.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_18.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_18.yaml" decode: num-worker: 3 tp: 8 @@ -4509,14 +4099,12 @@ dsr1-fp8-gb200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [9] + recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_9.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_9.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_9.yaml" decode: num-worker: 3 tp: 8 @@ -4524,98 +4112,84 @@ dsr1-fp8-gb200-dynamo-trt: dp-attn: false # 1k1k STP configs - conc-list: [6144] + recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch768_eplb0_mtp0_6144.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep8_batch768_eplb0_mtp0_6144.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep8_batch768_eplb0_mtp0_6144.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [4301] + recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4301.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4301.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4301.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [2151] + recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep16_batch128_eplb0_mtp0_2151.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep16_batch128_eplb0_mtp0_2151.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep16_batch128_eplb0_mtp0_2151.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [1127] + recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch32_eplb0_mtp0_1127.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep32_batch32_eplb0_mtp0_1127.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep32_batch32_eplb0_mtp0_1127.yaml" decode: num-worker: 1 tp: 32 ep: 32 dp-attn: true - conc-list: [256] + recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch8_eplb0_mtp0_256.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep32_batch8_eplb0_mtp0_256.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen1_dep32_batch8_eplb0_mtp0_256.yaml" decode: num-worker: 1 tp: 32 ep: 32 dp-attn: true - conc-list: [27] + recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch8_eplb0_mtp0_27.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen3_tep8_batch8_eplb0_mtp0_27.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen3_tep8_batch8_eplb0_mtp0_27.yaml" decode: num-worker: 3 tp: 8 ep: 8 dp-attn: false - conc-list: [3] + recipe: "dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_3.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_3.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/1k1k/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_3.yaml" decode: num-worker: 3 tp: 8 @@ -4627,14 +4201,12 @@ dsr1-fp8-gb200-dynamo-trt: search-space: - spec-decoding: "mtp" conc-list: [666] + recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3_666.yaml" prefill: num-worker: 3 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3_666.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3_666.yaml" decode: num-worker: 1 tp: 8 @@ -4642,14 +4214,12 @@ dsr1-fp8-gb200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [666] + recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx5_gen1_dep16_batch32_eplb0_mtp3_666.yaml" prefill: num-worker: 5 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/mtp/ctx5_gen1_dep16_batch32_eplb0_mtp3_666.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/mtp/ctx5_gen1_dep16_batch32_eplb0_mtp3_666.yaml" decode: num-worker: 1 tp: 16 @@ -4657,14 +4227,12 @@ dsr1-fp8-gb200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [333] + recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep16_batch16_eplb0_mtp3_333.yaml" prefill: num-worker: 3 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/mtp/ctx3_gen1_dep16_batch16_eplb0_mtp3_333.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/mtp/ctx3_gen1_dep16_batch16_eplb0_mtp3_333.yaml" decode: num-worker: 1 tp: 16 @@ -4672,14 +4240,12 @@ dsr1-fp8-gb200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [333] + recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep32_batch8_eplb0_mtp3_333.yaml" prefill: num-worker: 4 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/mtp/ctx4_gen1_dep32_batch8_eplb0_mtp3_333.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/mtp/ctx4_gen1_dep32_batch8_eplb0_mtp3_333.yaml" decode: num-worker: 1 tp: 32 @@ -4687,14 +4253,12 @@ dsr1-fp8-gb200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [90] + recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep32_batch2_eplb0_mtp3_90.yaml" prefill: num-worker: 2 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/mtp/ctx2_gen1_dep32_batch2_eplb0_mtp3_90.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/mtp/ctx2_gen1_dep32_batch2_eplb0_mtp3_90.yaml" decode: num-worker: 1 tp: 32 @@ -4702,14 +4266,12 @@ dsr1-fp8-gb200-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [15] + recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_15.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_15.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_15.yaml" decode: num-worker: 3 tp: 8 @@ -4717,14 +4279,12 @@ dsr1-fp8-gb200-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [6] + recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_6.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_6.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_6.yaml" decode: num-worker: 3 tp: 8 @@ -4732,98 +4292,84 @@ dsr1-fp8-gb200-dynamo-trt: dp-attn: false # 8k1k STP configs - conc-list: [1229] + recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx5_gen1_dep16_batch64_eplb0_mtp0_1229.yaml" prefill: num-worker: 5 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/stp/ctx5_gen1_dep16_batch64_eplb0_mtp0_1229.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/stp/ctx5_gen1_dep16_batch64_eplb0_mtp0_1229.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [666] + recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx4_gen1_dep32_batch16_eplb0_mtp0_666.yaml" prefill: num-worker: 4 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/stp/ctx4_gen1_dep32_batch16_eplb0_mtp0_666.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/stp/ctx4_gen1_dep32_batch16_eplb0_mtp0_666.yaml" decode: num-worker: 1 tp: 32 ep: 32 dp-attn: true - conc-list: [615] + recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx3_gen1_dep16_batch32_eplb0_mtp0_615.yaml" prefill: num-worker: 3 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/stp/ctx3_gen1_dep16_batch32_eplb0_mtp0_615.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/stp/ctx3_gen1_dep16_batch32_eplb0_mtp0_615.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [333] + recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0_333.yaml" prefill: num-worker: 2 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0_333.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0_333.yaml" decode: num-worker: 1 tp: 32 ep: 32 dp-attn: true - conc-list: [63] + recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0_63.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0_63.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0_63.yaml" decode: num-worker: 3 tp: 8 ep: 8 dp-attn: false - conc-list: [18] + recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch4_eplb0_mtp0_18.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/stp/ctx1_gen3_tep8_batch4_eplb0_mtp0_18.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/stp/ctx1_gen3_tep8_batch4_eplb0_mtp0_18.yaml" decode: num-worker: 3 tp: 8 ep: 8 dp-attn: false - conc-list: [6] + recipe: "dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_6.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb200-fp8/8k1k/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_6.yaml - - "CONFIG_FILE=recipes/trtllm/gb200-fp8/8k1k/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_6.yaml" decode: num-worker: 3 tp: 8 @@ -4846,14 +4392,12 @@ dsr1-fp8-gb200-dynamo-sglang: search-space: # "Low latency" (1 prefill worker at TP4 and 1 decode worker at TP4) - conc-list: [4, 8] + recipe: "dsr1/sglang/gb200-fp8/1k1k/disagg/stp/low-latency.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb200-fp8/1k1k/low-latency.yaml - - "CONFIG_FILE=recipes/gb200-fp8/1k1k/low-latency.yaml" decode: num-worker: 1 tp: 4 @@ -4862,14 +4406,12 @@ dsr1-fp8-gb200-dynamo-sglang: # "Mid curve" (3 prefill workers at DEP8 and 1 decode worker at DEP48) - conc-list: [1024, 2048, 4096] + recipe: "dsr1/sglang/gb200-fp8/1k1k/disagg/stp/mid-curve.yaml" prefill: num-worker: 3 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb200-fp8/1k1k/mid-curve.yaml - - "CONFIG_FILE=recipes/gb200-fp8/1k1k/mid-curve.yaml" decode: num-worker: 1 tp: 48 @@ -4878,14 +4420,12 @@ dsr1-fp8-gb200-dynamo-sglang: # "Max throughput" (2 prefill workers at DEP8 and 1 decode worker at DEP32) - conc-list: [1024, 2048, 4096, 6144] + recipe: "dsr1/sglang/gb200-fp8/1k1k/disagg/stp/max-tpt.yaml" prefill: num-worker: 2 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb200-fp8/1k1k/max-tpt.yaml - - "CONFIG_FILE=recipes/gb200-fp8/1k1k/max-tpt.yaml" decode: num-worker: 1 tp: 32 @@ -4894,14 +4434,12 @@ dsr1-fp8-gb200-dynamo-sglang: # "Ultra throughput" (1 prefill workers at DEP8 and 1 decode worker at DEP8) - conc-list: [4096] + recipe: "dsr1/sglang/gb200-fp8/1k1k/disagg/stp/ultra-tpt.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb200-fp8/1k1k/ultra-tpt.yaml - - "CONFIG_FILE=recipes/gb200-fp8/1k1k/ultra-tpt.yaml" decode: num-worker: 1 tp: 8 @@ -4913,14 +4451,12 @@ dsr1-fp8-gb200-dynamo-sglang: search-space: # "Low latency" (1 prefill worker at TP8 and 1 decode worker at TP8) - conc-list: [4, 8, 16] + recipe: "dsr1/sglang/gb200-fp8/8k1k/disagg/stp/low-latency.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb200-fp8/8k1k/low-latency.yaml - - "CONFIG_FILE=recipes/gb200-fp8/8k1k/low-latency.yaml" decode: num-worker: 1 tp: 8 @@ -4929,14 +4465,12 @@ dsr1-fp8-gb200-dynamo-sglang: # "Mid curve" (5 prefill workers at DEP8 and 1 decode worker at DEP32) - conc-list: [512, 1024, 2048, 6144] + recipe: "dsr1/sglang/gb200-fp8/8k1k/disagg/stp/mid-curve.yaml" prefill: num-worker: 5 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb200-fp8/8k1k/mid-curve.yaml - - "CONFIG_FILE=recipes/gb200-fp8/8k1k/mid-curve.yaml" decode: num-worker: 1 tp: 32 @@ -4945,14 +4479,12 @@ dsr1-fp8-gb200-dynamo-sglang: # "Max throughput" (6 prefill workers at DEP8 and 1 decode worker at DEP24) - conc-list: [2048, 4096, 6144] + recipe: "dsr1/sglang/gb200-fp8/8k1k/disagg/stp/max_tpt.yaml" prefill: num-worker: 6 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb200-fp8/8k1k/max_tpt.yaml - - "CONFIG_FILE=recipes/gb200-fp8/8k1k/max_tpt.yaml" decode: num-worker: 1 tp: 24 @@ -4974,14 +4506,12 @@ dsr1-fp8-gb300-dynamo-sglang: search-space: # "Low latency" (1 prefill worker at TP4 and 4 decode workers at TP4) - conc-list: [4, 8, 16, 32] + recipe: "dsr1/sglang/gb300-fp8/1k1k/disagg/stp/low-latency.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb300-fp8/1k1k/stp/low-latency.yaml - - "CONFIG_FILE=recipes/gb300-fp8/1k1k/stp/low-latency.yaml" decode: num-worker: 4 tp: 4 @@ -4990,14 +4520,12 @@ dsr1-fp8-gb300-dynamo-sglang: # "Mid curve" (2 prefill workers at DEP8 and 1 decode worker at DEP32) - conc-list: [1024, 2048, 4096, 6144] + recipe: "dsr1/sglang/gb300-fp8/1k1k/disagg/stp/mid.yaml" prefill: num-worker: 2 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb300-fp8/1k1k/stp/mid.yaml - - "CONFIG_FILE=recipes/gb300-fp8/1k1k/stp/mid.yaml" decode: num-worker: 1 tp: 32 @@ -5006,14 +4534,12 @@ dsr1-fp8-gb300-dynamo-sglang: # "Max throughput" (1 prefill worker at DEP8 and 1 decode worker at DEP8) - conc-list: [4096, 7168, 7680] + recipe: "dsr1/sglang/gb300-fp8/1k1k/disagg/stp/max.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb300-fp8/1k1k/stp/max.yaml - - "CONFIG_FILE=recipes/gb300-fp8/1k1k/stp/max.yaml" decode: num-worker: 1 tp: 8 @@ -5025,14 +4551,12 @@ dsr1-fp8-gb300-dynamo-sglang: search-space: # "Low latency" (1 prefill worker at TP4 and 1 decode worker at TP4) - conc-list: [4, 8] + recipe: "dsr1/sglang/gb300-fp8/8k1k/disagg/stp/low-latency.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb300-fp8/8k1k/stp/low-latency.yaml - - "CONFIG_FILE=recipes/gb300-fp8/8k1k/stp/low-latency.yaml" decode: num-worker: 1 tp: 4 @@ -5041,14 +4565,12 @@ dsr1-fp8-gb300-dynamo-sglang: # "Mid curve" (5 prefill workers at DEP8 and 1 decode worker at DEP32) - conc-list: [128, 256, 512, 1024] + recipe: "dsr1/sglang/gb300-fp8/8k1k/disagg/stp/mid.yaml" prefill: num-worker: 5 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb300-fp8/8k1k/stp/mid.yaml - - "CONFIG_FILE=recipes/gb300-fp8/8k1k/stp/mid.yaml" decode: num-worker: 1 tp: 32 @@ -5057,14 +4579,12 @@ dsr1-fp8-gb300-dynamo-sglang: # "Max throughput" (6 prefill workers at DEP8 and 1 decode worker at DEP24) - conc-list: [2048, 4096] + recipe: "dsr1/sglang/gb300-fp8/8k1k/disagg/stp/max.yaml" prefill: num-worker: 6 tp: 8 ep: 8 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/gb300-fp8/8k1k/stp/max.yaml - - "CONFIG_FILE=recipes/gb300-fp8/8k1k/stp/max.yaml" decode: num-worker: 1 tp: 24 @@ -5088,13 +4608,12 @@ dsr1-fp4-gb200-dynamo-sglang: # Low latency (1 prefill node, 2 decode nodes) - spec-decoding: "none" conc-list: [ 4, 8, 32 ] + recipe: "dsr1/sglang/gb200-fp4/1k1k/disagg/stp/low-latency.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/gb200-fp4/1k1k/low-latency.yaml" decode: num-worker: 2 tp: 4 @@ -5104,13 +4623,12 @@ dsr1-fp4-gb200-dynamo-sglang: # Mid curve (4 prefill nodes, 8 decode nodes) - spec-decoding: "none" conc-list: [ 512, 2048, 4096, 8192 ] + recipe: "dsr1/sglang/gb200-fp4/1k1k/disagg/stp/mid-curve.yaml" prefill: num-worker: 4 tp: 4 ep: 4 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/gb200-fp4/1k1k/mid-curve.yaml" decode: num-worker: 1 tp: 32 @@ -5120,13 +4638,12 @@ dsr1-fp4-gb200-dynamo-sglang: # Max throughput (4 prefill nodes, 12 decode nodes) - spec-decoding: "none" conc-list: [ 2048, 4096 ] + recipe: "dsr1/sglang/gb200-fp4/1k1k/disagg/stp/max-tpt.yaml" prefill: num-worker: 4 tp: 4 ep: 4 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/gb200-fp4/1k1k/max-tpt.yaml" decode: num-worker: 1 tp: 48 @@ -5140,13 +4657,12 @@ dsr1-fp4-gb200-dynamo-sglang: # Low latency (1 prefill node, 4 decode nodes) - spec-decoding: "none" conc-list: [ 4, 8 ] + recipe: "dsr1/sglang/gb200-fp4/8k1k/disagg/stp/low-latency.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/gb200-fp4/8k1k/low-latency.yaml" decode: num-worker: 4 tp: 4 @@ -5156,13 +4672,12 @@ dsr1-fp4-gb200-dynamo-sglang: # Mid curve (6 prefill nodes, 12 decode nodes) - spec-decoding: "none" conc-list: [ 512, 2048, 4096 ] + recipe: "dsr1/sglang/gb200-fp4/8k1k/disagg/stp/mid-curve.yaml" prefill: num-worker: 6 tp: 4 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/gb200-fp4/8k1k/mid-curve.yaml" decode: num-worker: 1 tp: 48 @@ -5172,13 +4687,12 @@ dsr1-fp4-gb200-dynamo-sglang: # Max throughput (10 prefill nodes, 8 decode nodes) - spec-decoding: "none" conc-list: [ 2048 ] + recipe: "dsr1/sglang/gb200-fp4/8k1k/disagg/stp/max-tpt.yaml" prefill: num-worker: 10 tp: 4 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/gb200-fp4/8k1k/max-tpt.yaml" decode: num-worker: 1 tp: 32 @@ -5201,14 +4715,12 @@ dsr1-fp4-gb300-dynamo-trt: # MTP configurations - spec-decoding: "mtp" conc-list: [3226] + recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep4_batch768_eplb0_mtp1.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen1_dep4_batch768_eplb0_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen1_dep4_batch768_eplb0_mtp1.yaml" decode: num-worker: 1 tp: 4 @@ -5216,14 +4728,12 @@ dsr1-fp4-gb300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [333] + recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch8_eplb0_mtp.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen1_dep32_batch8_eplb0_mtp.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen1_dep32_batch8_eplb0_mtp.yaml" decode: num-worker: 1 tp: 32 @@ -5231,14 +4741,12 @@ dsr1-fp4-gb300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [5] + recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml" decode: num-worker: 4 tp: 8 @@ -5246,14 +4754,12 @@ dsr1-fp4-gb300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [8, 12, 24, 48] + recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml" decode: num-worker: 4 tp: 8 @@ -5261,14 +4767,12 @@ dsr1-fp4-gb300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [2253] + recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep16_batch128_eplb256_mtp1.yaml" prefill: num-worker: 3 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/mtp/ctx3_gen1_dep16_batch128_eplb256_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/mtp/ctx3_gen1_dep16_batch128_eplb256_mtp1.yaml" decode: num-worker: 1 tp: 16 @@ -5276,14 +4780,12 @@ dsr1-fp4-gb300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [1229] + recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch32_eplb288_mtp3.yaml" prefill: num-worker: 3 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/mtp/ctx3_gen1_dep32_batch32_eplb288_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/mtp/ctx3_gen1_dep32_batch32_eplb288_mtp3.yaml" decode: num-worker: 1 tp: 32 @@ -5291,84 +4793,72 @@ dsr1-fp4-gb300-dynamo-trt: dp-attn: true # Non-MTP configurations (default spec_decoding="none") - conc-list: [5] + recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [12, 48, 96, 192] + recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [8192] + recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch1024_eplb0_mtp0.yaml" prefill: num-worker: 2 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/stp/ctx2_gen1_dep8_batch1024_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/stp/ctx2_gen1_dep8_batch1024_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [1229] + recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0.yaml" prefill: num-worker: 2 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 32 ep: 32 dp-attn: true - conc-list: [4301] + recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep16_batch256_eplb256_mtp0.yaml" prefill: num-worker: 3 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/stp/ctx3_gen1_dep16_batch256_eplb256_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/stp/ctx3_gen1_dep16_batch256_eplb256_mtp0.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [2253] + recipe: "dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep32_batch64_eplb0_mtp0.yaml" prefill: num-worker: 3 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/1k1k/stp/ctx3_gen1_dep32_batch64_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/1k1k/stp/ctx3_gen1_dep32_batch64_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 32 @@ -5380,14 +4870,12 @@ dsr1-fp4-gb300-dynamo-trt: # MTP configurations (spec_decoding="mtp") - spec-decoding: "mtp" conc-list: [33] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3.yaml" decode: num-worker: 3 tp: 8 @@ -5395,14 +4883,12 @@ dsr1-fp4-gb300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [5] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml" decode: num-worker: 4 tp: 8 @@ -5410,14 +4896,12 @@ dsr1-fp4-gb300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [12, 24] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml" decode: num-worker: 4 tp: 8 @@ -5425,14 +4909,12 @@ dsr1-fp4-gb300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [180] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx4_gen1_dep32_batch4_eplb0_mtp3.yaml" prefill: num-worker: 4 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx4_gen1_dep32_batch4_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx4_gen1_dep32_batch4_eplb0_mtp3.yaml" decode: num-worker: 1 tp: 32 @@ -5440,14 +4922,12 @@ dsr1-fp4-gb300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [308] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch8_eplb0_mtp3.yaml" prefill: num-worker: 8 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx8_gen1_dep32_batch8_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx8_gen1_dep32_batch8_eplb0_mtp3.yaml" decode: num-worker: 1 tp: 32 @@ -5455,14 +4935,12 @@ dsr1-fp4-gb300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [2253] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml" prefill: num-worker: 10 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml" decode: num-worker: 1 tp: 8 @@ -5470,14 +4948,12 @@ dsr1-fp4-gb300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [666] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep16_batch32_eplb0_mtp3.yaml" prefill: num-worker: 10 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx10_gen1_dep16_batch32_eplb0_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx10_gen1_dep16_batch32_eplb0_mtp3.yaml" decode: num-worker: 1 tp: 16 @@ -5485,14 +4961,12 @@ dsr1-fp4-gb300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [1127] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx13_gen1_dep16_batch64_eplb256_mtp3.yaml" prefill: num-worker: 13 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/mtp/ctx13_gen1_dep16_batch64_eplb256_mtp3.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/mtp/ctx13_gen1_dep16_batch64_eplb256_mtp3.yaml" decode: num-worker: 1 tp: 16 @@ -5500,112 +4974,96 @@ dsr1-fp4-gb300-dynamo-trt: dp-attn: true # Non-MTP configurations (default spec_decoding="none") - conc-list: [72] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 8 ep: 8 dp-attn: false - conc-list: [5] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [12] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch2_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen4_tep8_batch2_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen4_tep8_batch2_eplb0_mtp0.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [5, 15, 30] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml" decode: num-worker: 5 tp: 4 ep: 4 dp-attn: false - conc-list: [666] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch16_eplb0_mtp0.yaml" prefill: num-worker: 7 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx7_gen1_dep32_batch16_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx7_gen1_dep32_batch16_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 32 ep: 32 dp-attn: true - conc-list: [1229] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx9_gen1_dep16_batch64_eplb0_mtp0.yaml" prefill: num-worker: 9 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx9_gen1_dep16_batch64_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx9_gen1_dep16_batch64_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [3228] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx11_gen3_dep4_batch256_eplb0_mtp0.yaml" prefill: num-worker: 11 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx11_gen3_dep4_batch256_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx11_gen3_dep4_batch256_eplb0_mtp0.yaml" decode: num-worker: 3 tp: 4 ep: 4 dp-attn: true - conc-list: [2253] + recipe: "dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx14_gen1_dep16_batch128_eplb0_mtp0.yaml" prefill: num-worker: 14 tp: 2 ep: 2 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp4/8k1k/stp/ctx14_gen1_dep16_batch128_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp4/8k1k/stp/ctx14_gen1_dep16_batch128_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 16 @@ -5629,13 +5087,12 @@ dsr1-fp4-gb300-dynamo-sglang: # Low latency (1 prefill node, 2 decode nodes) - spec-decoding: "none" conc-list: [ 4, 8, 32 ] + recipe: "dsr1/sglang/gb300-fp4/1k1k/disagg/stp/low_latency.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/gb300-fp4/1k1k/low_latency.yaml" decode: num-worker: 2 tp: 4 @@ -5645,13 +5102,12 @@ dsr1-fp4-gb300-dynamo-sglang: # Mid curve (4 prefill nodes, 8 decode nodes) - spec-decoding: "none" conc-list: [ 512, 2048, 4096, 8192 ] + recipe: "dsr1/sglang/gb300-fp4/1k1k/disagg/stp/mid_curve.yaml" prefill: num-worker: 4 tp: 4 ep: 4 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/gb300-fp4/1k1k/mid_curve.yaml" decode: num-worker: 1 tp: 32 @@ -5661,13 +5117,12 @@ dsr1-fp4-gb300-dynamo-sglang: # Max throughput (4 prefill nodes, 12 decode nodes) - spec-decoding: "none" conc-list: [ 512, 2048, 4096, 8192 ] + recipe: "dsr1/sglang/gb300-fp4/1k1k/disagg/stp/max_tpt.yaml" prefill: num-worker: 4 tp: 4 ep: 4 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/gb300-fp4/1k1k/max_tpt.yaml" decode: num-worker: 1 tp: 48 @@ -5681,13 +5136,12 @@ dsr1-fp4-gb300-dynamo-sglang: # Low latency (1 prefill node, 4 decode nodes) - spec-decoding: "none" conc-list: [ 4, 8, 32, 64 ] + recipe: "dsr1/sglang/gb300-fp4/8k1k/disagg/stp/low_latency.yaml" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/gb300-fp4/8k1k/low_latency.yaml" decode: num-worker: 4 tp: 4 @@ -5697,13 +5151,12 @@ dsr1-fp4-gb300-dynamo-sglang: # Mid curve (6 prefill nodes, 12 decode nodes) - spec-decoding: "none" conc-list: [ 512, 2048, 4096 ] + recipe: "dsr1/sglang/gb300-fp4/8k1k/disagg/stp/mid_curve.yaml" prefill: num-worker: 6 tp: 4 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/gb300-fp4/8k1k/mid_curve.yaml" decode: num-worker: 1 tp: 48 @@ -5713,13 +5166,12 @@ dsr1-fp4-gb300-dynamo-sglang: # Max throughput (10 prefill nodes, 8 decode nodes) - spec-decoding: "none" conc-list: [ 2048 ] + recipe: "dsr1/sglang/gb300-fp4/8k1k/disagg/stp/max_tpt.yaml" prefill: num-worker: 10 tp: 4 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/gb300-fp4/8k1k/max_tpt.yaml" decode: num-worker: 1 tp: 32 @@ -5742,14 +5194,12 @@ dsr1-fp8-gb300-dynamo-trt: # MTP configurations (spec_decoding="mtp") - spec-decoding: "mtp" conc-list: [8] + recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml" decode: num-worker: 4 tp: 8 @@ -5757,14 +5207,12 @@ dsr1-fp8-gb300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [24] + recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml" decode: num-worker: 4 tp: 8 @@ -5772,14 +5220,12 @@ dsr1-fp8-gb300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [180] + recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3_180.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3_180.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3_180.yaml" decode: num-worker: 1 tp: 32 @@ -5787,14 +5233,12 @@ dsr1-fp8-gb300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [564] + recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep32_batch16_eplb0_mtp3_564.yaml" prefill: num-worker: 2 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/mtp/ctx2_gen1_dep32_batch16_eplb0_mtp3_564.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/mtp/ctx2_gen1_dep32_batch16_eplb0_mtp3_564.yaml" decode: num-worker: 1 tp: 32 @@ -5802,14 +5246,12 @@ dsr1-fp8-gb300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [666] + recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp3_666.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp3_666.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp3_666.yaml" decode: num-worker: 1 tp: 16 @@ -5817,14 +5259,12 @@ dsr1-fp8-gb300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [2253] + recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep16_batch128_eplb0_mtp1_2253.yaml" prefill: num-worker: 2 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/mtp/ctx2_gen1_dep16_batch128_eplb0_mtp1_2253.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/mtp/ctx2_gen1_dep16_batch128_eplb0_mtp1_2253.yaml" decode: num-worker: 1 tp: 16 @@ -5832,14 +5272,12 @@ dsr1-fp8-gb300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [8192] + recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx3_gen2_dep8_batch512_eplb0_mtp1_8192.yaml" prefill: num-worker: 3 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/mtp/ctx3_gen2_dep8_batch512_eplb0_mtp1_8192.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/mtp/ctx3_gen2_dep8_batch512_eplb0_mtp1_8192.yaml" decode: num-worker: 2 tp: 8 @@ -5847,98 +5285,84 @@ dsr1-fp8-gb300-dynamo-trt: dp-attn: true # STP configurations (no spec_decoding) - conc-list: [4] + recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [24] + recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [84] + recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0_84.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0_84.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0_84.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [1229] + recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0_1229.yaml" prefill: num-worker: 2 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0_1229.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0_1229.yaml" decode: num-worker: 1 tp: 32 ep: 32 dp-attn: true - conc-list: [2253] + recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch128_eplb0_mtp0_2253.yaml" prefill: num-worker: 2 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/stp/ctx2_gen1_dep16_batch128_eplb0_mtp0_2253.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/stp/ctx2_gen1_dep16_batch128_eplb0_mtp0_2253.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [8602] + recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch512_eplb0_mtp0_8602.yaml" prefill: num-worker: 3 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/stp/ctx3_gen2_dep8_batch512_eplb0_mtp0_8602.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/stp/ctx3_gen2_dep8_batch512_eplb0_mtp0_8602.yaml" decode: num-worker: 2 tp: 8 ep: 8 dp-attn: true - conc-list: [12288] + recipe: "dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch768_eplb0_mtp0_12288.yaml" prefill: num-worker: 3 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/1k1k/stp/ctx3_gen2_dep8_batch768_eplb0_mtp0_12288.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/1k1k/stp/ctx3_gen2_dep8_batch768_eplb0_mtp0_12288.yaml" decode: num-worker: 2 tp: 8 @@ -5950,14 +5374,12 @@ dsr1-fp8-gb300-dynamo-trt: # MTP configurations (spec_decoding="mtp") - spec-decoding: "mtp" conc-list: [8] + recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml" decode: num-worker: 4 tp: 8 @@ -5965,14 +5387,12 @@ dsr1-fp8-gb300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [24] + recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml" decode: num-worker: 4 tp: 8 @@ -5980,14 +5400,12 @@ dsr1-fp8-gb300-dynamo-trt: dp-attn: false - spec-decoding: "mtp" conc-list: [333] + recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx6_gen1_dep32_batch8_eplb0_mtp3_333.yaml" prefill: num-worker: 6 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/mtp/ctx6_gen1_dep32_batch8_eplb0_mtp3_333.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/mtp/ctx6_gen1_dep32_batch8_eplb0_mtp3_333.yaml" decode: num-worker: 1 tp: 32 @@ -5995,14 +5413,12 @@ dsr1-fp8-gb300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [666] + recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx8_gen1_dep16_batch32_eplb0_mtp3_666.yaml" prefill: num-worker: 8 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/mtp/ctx8_gen1_dep16_batch32_eplb0_mtp3_666.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/mtp/ctx8_gen1_dep16_batch32_eplb0_mtp3_666.yaml" decode: num-worker: 1 tp: 16 @@ -6010,14 +5426,12 @@ dsr1-fp8-gb300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [1229] + recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx10_gen1_dep16_batch64_eplb0_mtp1_1229.yaml" prefill: num-worker: 10 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/mtp/ctx10_gen1_dep16_batch64_eplb0_mtp1_1229.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/mtp/ctx10_gen1_dep16_batch64_eplb0_mtp1_1229.yaml" decode: num-worker: 1 tp: 16 @@ -6025,14 +5439,12 @@ dsr1-fp8-gb300-dynamo-trt: dp-attn: true - spec-decoding: "mtp" conc-list: [1229] + recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx7_gen1_dep8_batch128_eplb0_mtp1_1229.yaml" prefill: num-worker: 7 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/mtp/ctx7_gen1_dep8_batch128_eplb0_mtp1_1229.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/mtp/ctx7_gen1_dep8_batch128_eplb0_mtp1_1229.yaml" decode: num-worker: 1 tp: 8 @@ -6040,98 +5452,84 @@ dsr1-fp8-gb300-dynamo-trt: dp-attn: true # STP configurations (no spec_decoding) - conc-list: [4] + recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [24] + recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [36] + recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch8_eplb0_mtp0_36.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/stp/ctx1_gen4_tep8_batch8_eplb0_mtp0_36.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/stp/ctx1_gen4_tep8_batch8_eplb0_mtp0_36.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [512] + recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx6_gen1_dep32_batch16_eplb0_mtp0_512.yaml" prefill: num-worker: 6 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/stp/ctx6_gen1_dep32_batch16_eplb0_mtp0_512.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/stp/ctx6_gen1_dep32_batch16_eplb0_mtp0_512.yaml" decode: num-worker: 1 tp: 32 ep: 32 dp-attn: true - conc-list: [666] + recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx4_gen1_dep16_batch32_eplb0_mtp0_666.yaml" prefill: num-worker: 4 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/stp/ctx4_gen1_dep16_batch32_eplb0_mtp0_666.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/stp/ctx4_gen1_dep16_batch32_eplb0_mtp0_666.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [1229] + recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep16_batch64_eplb0_mtp0_1229.yaml" prefill: num-worker: 7 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/stp/ctx7_gen1_dep16_batch64_eplb0_mtp0_1229.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/stp/ctx7_gen1_dep16_batch64_eplb0_mtp0_1229.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [2151] + recipe: "dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch256_eplb0_mtp0_2151.yaml" prefill: num-worker: 7 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/gb300-fp8/8k1k/stp/ctx7_gen1_dep8_batch256_eplb0_mtp0_2151.yaml - - "CONFIG_FILE=recipes/trtllm/gb300-fp8/8k1k/stp/ctx7_gen1_dep8_batch256_eplb0_mtp0_2151.yaml" decode: num-worker: 1 tp: 8 @@ -6424,13 +5822,12 @@ dsr1-fp8-h200-dynamo-sglang: # STP: Low latency (1 prefill, 9 decode, TEP) - spec-decoding: "none" conc-list: [1, 4, 8, 16, 32, 64, 128, 256] + recipe: "dsr1/sglang/h200-fp8/1k1k/disagg/stp/low-latency-1p9d.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h200/1k1k/low-latency-1p9d.yaml" decode: num-worker: 9 tp: 8 @@ -6439,13 +5836,12 @@ dsr1-fp8-h200-dynamo-sglang: # STP: High throughput TEP (1 prefill, 6 decode) - spec-decoding: "none" conc-list: [512, 1024, 2048] + recipe: "dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-tp.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h200/1k1k/bs256-1p6d-tp.yaml" decode: num-worker: 6 tp: 8 @@ -6454,13 +5850,12 @@ dsr1-fp8-h200-dynamo-sglang: # STP: High throughput DEP (1 prefill, 6 decode, dp-attention) - spec-decoding: "none" conc-list: [128, 256, 512, 1024, 2048] + recipe: "dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-dep.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/h200/1k1k/bs256-1p6d-dep.yaml" decode: num-worker: 6 tp: 8 @@ -6469,13 +5864,12 @@ dsr1-fp8-h200-dynamo-sglang: # MTP: Low latency (1 prefill, 9 decode, TEP) - spec-decoding: "mtp" conc-list: [1, 4, 8, 16, 32, 64, 128, 256] + recipe: "dsr1/sglang/h200-fp8/1k1k/disagg/mtp/low-latency-1p9d-mtp.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h200/1k1k/low-latency-1p9d-mtp.yaml" decode: num-worker: 9 tp: 8 @@ -6484,13 +5878,12 @@ dsr1-fp8-h200-dynamo-sglang: # MTP: High throughput TEP (1 prefill, 6 decode) - spec-decoding: "mtp" conc-list: [512, 1024, 2048] + recipe: "dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-tp-mtp.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h200/1k1k/bs256-1p6d-tp-mtp.yaml" decode: num-worker: 6 tp: 8 @@ -6499,13 +5892,12 @@ dsr1-fp8-h200-dynamo-sglang: # MTP: High throughput DEP (1 prefill, 6 decode, dp-attention) - spec-decoding: "mtp" conc-list: [128, 256, 512, 1024, 2048] + recipe: "dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-dep-mtp.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/h200/1k1k/bs256-1p6d-dep-mtp.yaml" decode: num-worker: 6 tp: 8 @@ -6517,13 +5909,12 @@ dsr1-fp8-h200-dynamo-sglang: # STP: Low latency TEP (1 prefill, 7 decode) - spec-decoding: "none" conc-list: [1, 4, 8] + recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs4-1p7d.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h200/8k1k/bs4-1p7d.yaml" decode: num-worker: 7 tp: 8 @@ -6532,13 +5923,12 @@ dsr1-fp8-h200-dynamo-sglang: # STP: TEP (1 prefill, 6 decode) - spec-decoding: "none" conc-list: [4, 8, 16] + recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs8-1p6d.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h200/8k1k/bs8-1p6d.yaml" decode: num-worker: 6 tp: 8 @@ -6547,13 +5937,12 @@ dsr1-fp8-h200-dynamo-sglang: # STP: TEP (1 prefill, 3 decode) - spec-decoding: "none" conc-list: [8, 16, 32] + recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs16-1p3d.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h200/8k1k/bs16-1p3d.yaml" decode: num-worker: 3 tp: 8 @@ -6562,13 +5951,12 @@ dsr1-fp8-h200-dynamo-sglang: # STP: TEP (2 prefill, 3 decode) - spec-decoding: "none" conc-list: [32, 64, 128] + recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs64-2p3d.yaml" prefill: num-worker: 2 tp: 8 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h200/8k1k/bs64-2p3d.yaml" decode: num-worker: 3 tp: 8 @@ -6577,13 +5965,12 @@ dsr1-fp8-h200-dynamo-sglang: # STP: High throughput DEP (1 prefill, 1 decode, dp-attention) - spec-decoding: "none" conc-list: [64, 128, 256] + recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs128-1p1d-dep.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h200/8k1k/bs128-1p1d-dep.yaml" decode: num-worker: 1 tp: 8 @@ -6592,13 +5979,12 @@ dsr1-fp8-h200-dynamo-sglang: # MTP: Low latency TEP (1 prefill, 7 decode) - spec-decoding: "mtp" conc-list: [1, 4, 8] + recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs4-1p7d-mtp.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h200/8k1k/bs4-1p7d-mtp.yaml" decode: num-worker: 7 tp: 8 @@ -6607,13 +5993,12 @@ dsr1-fp8-h200-dynamo-sglang: # MTP: TEP (1 prefill, 6 decode) - spec-decoding: "mtp" conc-list: [2, 4, 8, 16, 32] + recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs8-1p6d-mtp.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h200/8k1k/bs8-1p6d-mtp.yaml" decode: num-worker: 6 tp: 8 @@ -6622,13 +6007,12 @@ dsr1-fp8-h200-dynamo-sglang: # MTP: TEP (1 prefill, 3 decode) - spec-decoding: "mtp" conc-list: [4, 8, 16, 32, 64] + recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs16-1p3d-mtp.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h200/8k1k/bs16-1p3d-mtp.yaml" decode: num-worker: 3 tp: 8 @@ -6637,13 +6021,12 @@ dsr1-fp8-h200-dynamo-sglang: # MTP: TEP (2 prefill, 3 decode) - spec-decoding: "mtp" conc-list: [32, 64, 128] + recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs64-2p3d-mtp.yaml" prefill: num-worker: 2 tp: 8 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h200/8k1k/bs64-2p3d-mtp.yaml" decode: num-worker: 3 tp: 8 @@ -6652,13 +6035,12 @@ dsr1-fp8-h200-dynamo-sglang: # MTP: High throughput DEP (1 prefill, 1 decode, dp-attention) - spec-decoding: "mtp" conc-list: [32, 64, 128, 256, 512] + recipe: "dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs128-1p1d-dep-mtp.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/h200/8k1k/bs128-1p1d-dep-mtp.yaml" decode: num-worker: 1 tp: 8 @@ -6680,52 +6062,48 @@ dsr1-fp4-b200-dynamo-sglang: search-space: # Non-MTP configurations - conc-list: [16, 128] + recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_stp_lowlat[0]" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_stp_lowlat[0]" decode: num-worker: 5 tp: 8 ep: 8 dp-attn: false - conc-list: [32, 64, 256] + recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_stp_lowlat[1]" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_stp_lowlat[1]" decode: num-worker: 6 tp: 8 ep: 8 dp-attn: false - conc-list: [512] + recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_stp_maxtpt[0]" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_stp_maxtpt[0]" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [512] + recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_stp_maxtpt[1]" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_stp_maxtpt[1]" decode: num-worker: 2 tp: 8 @@ -6736,65 +6114,60 @@ dsr1-fp4-b200-dynamo-sglang: search-space: # Non-MTP configurations - conc-list: [64, 128] + recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:zip_override_stp_lowlat[0]" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:zip_override_stp_lowlat[0]" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: false - conc-list: [8] + recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:zip_override_stp_lowlat[1]" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:zip_override_stp_lowlat[1]" decode: num-worker: 5 tp: 8 ep: 8 dp-attn: false - conc-list: [4, 128] + recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:zip_override_stp_lowlat[2]" prefill: num-worker: 2 tp: 4 ep: 4 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:zip_override_stp_lowlat[2]" decode: num-worker: 5 tp: 8 ep: 8 dp-attn: false - conc-list: [4, 8, 16, 64] + recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:override_stp_tp4" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:override_stp_tp4" decode: num-worker: 1 tp: 8 ep: 1 dp-attn: false - conc-list: [1024, 2048] + recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:override_stp_maxtpt_7p2d" prefill: num-worker: 7 tp: 4 ep: 4 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:override_stp_maxtpt_7p2d" decode: num-worker: 2 tp: 8 @@ -6816,52 +6189,48 @@ dsr1-fp8-b200-dynamo-sglang: search-space: # Non-MTP configurations - conc-list: [4] + recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_stp_lowlat[0]" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_stp_lowlat[0]" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: false - conc-list: [16, 32, 64, 128, 256] + recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_stp_lowlat[1]" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_stp_lowlat[1]" decode: num-worker: 3 tp: 8 ep: 8 dp-attn: false - conc-list: [1024, 2048, 4096] + recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_stp_maxtpt[0]" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_stp_maxtpt[0]" decode: num-worker: 5 tp: 8 ep: 8 dp-attn: true - conc-list: [2048, 4096] + recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_stp_maxtpt[1]" prefill: num-worker: 2 tp: 8 ep: 8 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_stp_maxtpt[1]" decode: num-worker: 5 tp: 8 @@ -6872,42 +6241,36 @@ dsr1-fp8-b200-dynamo-sglang: search-space: # STP low-latency: resolved from 8k1k.yaml zip_override_stp_lowlat - conc-list: [128] + recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_0.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_stp_lowlat_0.yaml - - "CONFIG_FILE=recipes/b200-fp8/8k1k_stp_lowlat_0.yaml" decode: num-worker: 3 tp: 8 ep: 1 dp-attn: false - conc-list: [128] + recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_1.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_stp_lowlat_1.yaml - - "CONFIG_FILE=recipes/b200-fp8/8k1k_stp_lowlat_1.yaml" decode: num-worker: 4 tp: 8 ep: 1 dp-attn: false - conc-list: [8, 16, 32, 64, 128] + recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_2.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_stp_lowlat_2.yaml - - "CONFIG_FILE=recipes/b200-fp8/8k1k_stp_lowlat_2.yaml" decode: num-worker: 6 tp: 8 @@ -6915,56 +6278,48 @@ dsr1-fp8-b200-dynamo-sglang: dp-attn: false # STP max-throughput: resolved from 8k1k.yaml zip_override_stp_maxtpt - conc-list: [288] + recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_0.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_stp_maxtpt_0.yaml - - "CONFIG_FILE=recipes/b200-fp8/8k1k_stp_maxtpt_0.yaml" decode: num-worker: 2 tp: 8 ep: 8 dp-attn: true - conc-list: [160, 288] + recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_1.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_stp_maxtpt_1.yaml - - "CONFIG_FILE=recipes/b200-fp8/8k1k_stp_maxtpt_1.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [512] + recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_2.yaml" prefill: num-worker: 2 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_stp_maxtpt_2.yaml - - "CONFIG_FILE=recipes/b200-fp8/8k1k_stp_maxtpt_2.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [1024] + recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_3.yaml" prefill: num-worker: 3 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_stp_maxtpt_3.yaml - - "CONFIG_FILE=recipes/b200-fp8/8k1k_stp_maxtpt_3.yaml" decode: num-worker: 1 tp: 8 @@ -6987,13 +6342,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp: # MTP low-latency: 1P1D - spec-decoding: "mtp" conc-list: [4, 64] + recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_mtp_lowlat[0]" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_mtp_lowlat[0]" decode: num-worker: 1 tp: 8 @@ -7002,13 +6356,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp: # MTP low-latency: 1P3D - spec-decoding: "mtp" conc-list: [4, 8, 16, 32, 128] + recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_mtp_lowlat[1]" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: false - additional-settings: - - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_mtp_lowlat[1]" decode: num-worker: 3 tp: 8 @@ -7017,13 +6370,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp: # MTP max-tpt: 1P5D - spec-decoding: "mtp" conc-list: [512, 4096] + recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_mtp_maxtpt[1]" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_mtp_maxtpt[1]" decode: num-worker: 5 tp: 8 @@ -7032,13 +6384,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp: # MTP max-tpt: 2P5D - spec-decoding: "mtp" conc-list: [1024, 2048, 4096] + recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:zip_override_mtp_maxtpt[2]" prefill: num-worker: 2 tp: 8 ep: 8 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:zip_override_mtp_maxtpt[2]" decode: num-worker: 5 tp: 8 @@ -7047,13 +6398,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp: # MTP max-tpt: 1P2D - spec-decoding: "mtp" conc-list: [512, 1024, 2048] + recipe: "dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml:override_mtp_maxtpt_1p2d" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/b200-fp8/1k1k.yaml:override_mtp_maxtpt_1p2d" decode: num-worker: 2 tp: 8 @@ -7065,14 +6415,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp: # MTP low-latency: resolved from 8k1k.yaml zip_override_mtp_lowlat - spec-decoding: "mtp" conc-list: [128] + recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_0.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_mtp_lowlat_0.yaml - - "CONFIG_FILE=recipes/b200-fp8/8k1k_mtp_lowlat_0.yaml" decode: num-worker: 3 tp: 8 @@ -7080,14 +6428,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp: dp-attn: false - spec-decoding: "mtp" conc-list: [128] + recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_1.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_mtp_lowlat_1.yaml - - "CONFIG_FILE=recipes/b200-fp8/8k1k_mtp_lowlat_1.yaml" decode: num-worker: 4 tp: 8 @@ -7095,14 +6441,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp: dp-attn: false - spec-decoding: "mtp" conc-list: [8, 16, 32, 64, 128] + recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_2.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_mtp_lowlat_2.yaml - - "CONFIG_FILE=recipes/b200-fp8/8k1k_mtp_lowlat_2.yaml" decode: num-worker: 6 tp: 8 @@ -7111,14 +6455,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp: # MTP max-throughput: resolved from 8k1k.yaml zip_override_mtp_maxtpt - spec-decoding: "mtp" conc-list: [288] + recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_0.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_mtp_maxtpt_0.yaml - - "CONFIG_FILE=recipes/b200-fp8/8k1k_mtp_maxtpt_0.yaml" decode: num-worker: 2 tp: 8 @@ -7126,14 +6468,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp: dp-attn: true - spec-decoding: "mtp" conc-list: [160, 288] + recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_1.yaml" prefill: num-worker: 1 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_mtp_maxtpt_1.yaml - - "CONFIG_FILE=recipes/b200-fp8/8k1k_mtp_maxtpt_1.yaml" decode: num-worker: 1 tp: 8 @@ -7141,14 +6481,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp: dp-attn: true - spec-decoding: "mtp" conc-list: [512] + recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_2.yaml" prefill: num-worker: 2 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_mtp_maxtpt_2.yaml - - "CONFIG_FILE=recipes/b200-fp8/8k1k_mtp_maxtpt_2.yaml" decode: num-worker: 1 tp: 8 @@ -7156,14 +6494,12 @@ dsr1-fp8-b200-dynamo-sglang-mtp: dp-attn: true - spec-decoding: "mtp" conc-list: [1024] + recipe: "dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_3.yaml" prefill: num-worker: 3 tp: 8 ep: 1 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp8/8k1k_mtp_maxtpt_3.yaml - - "CONFIG_FILE=recipes/b200-fp8/8k1k_mtp_maxtpt_3.yaml" decode: num-worker: 1 tp: 8 @@ -7185,14 +6521,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp: search-space: - spec-decoding: "mtp" conc-list: [16, 512] + recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_mtp_lowlat[0]" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/1k1k.yaml - - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_mtp_lowlat[0]" decode: num-worker: 5 tp: 8 @@ -7200,14 +6534,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp: dp-attn: false - spec-decoding: "mtp" conc-list: [32, 64, 256, 512] + recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_mtp_lowlat[1]" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/1k1k.yaml - - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_mtp_lowlat[1]" decode: num-worker: 6 tp: 8 @@ -7215,14 +6547,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp: dp-attn: false - spec-decoding: "mtp" conc-list: [512, 1024] + recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_mtp_maxtpt[0]" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/1k1k.yaml - - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_mtp_maxtpt[0]" decode: num-worker: 1 tp: 8 @@ -7230,14 +6560,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp: dp-attn: true - spec-decoding: "mtp" conc-list: [512] + recipe: "dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml:zip_override_mtp_maxtpt[1]" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/1k1k.yaml - - "CONFIG_FILE=recipes/b200-fp4/1k1k.yaml:zip_override_mtp_maxtpt[1]" decode: num-worker: 2 tp: 8 @@ -7251,14 +6579,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp: search-space: - spec-decoding: "mtp" conc-list: [64, 128] + recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:zip_override_mtp_lowlat[0]" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/8k1k.yaml - - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:zip_override_mtp_lowlat[0]" decode: num-worker: 1 tp: 8 @@ -7266,14 +6592,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp: dp-attn: false - spec-decoding: "mtp" conc-list: [8] + recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:zip_override_mtp_lowlat[1]" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/8k1k.yaml - - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:zip_override_mtp_lowlat[1]" decode: num-worker: 5 tp: 8 @@ -7281,14 +6605,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp: dp-attn: false - spec-decoding: "mtp" conc-list: [4, 128] + recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:zip_override_mtp_lowlat[2]" prefill: num-worker: 2 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/8k1k.yaml - - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:zip_override_mtp_lowlat[2]" decode: num-worker: 5 tp: 8 @@ -7296,14 +6618,12 @@ dsr1-fp4-b200-dynamo-sglang-mtp: dp-attn: false - spec-decoding: "mtp" conc-list: [4, 8, 16, 64] + recipe: "dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml:override_mtp_tp4" prefill: num-worker: 1 tp: 4 ep: 1 dp-attn: false - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/b200-fp4/8k1k.yaml - - "CONFIG_FILE=recipes/b200-fp4/8k1k.yaml:override_mtp_tp4" decode: num-worker: 1 tp: 8 @@ -7325,98 +6645,84 @@ kimik2.5-fp4-gb200-dynamo-trt: search-space: # Non-MTP configurations (default spec_decoding="none") - conc-list: [ 4, 192, 360, 668 ] + recipe: "kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [ 5, 15, 30, 55 ] + recipe: "kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml" decode: num-worker: 5 tp: 4 ep: 4 dp-attn: false - conc-list: [ 666 ] + recipe: "kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [ 2253 ] + recipe: "kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 32 ep: 32 dp-attn: true - conc-list: [ 4301, 6452 ] + recipe: "kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [ 4301 ] + recipe: "kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml" prefill: num-worker: 2 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [ 4301 ] + recipe: "kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml" prefill: num-worker: 2 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 32 @@ -7428,98 +6734,84 @@ kimik2.5-fp4-gb200-dynamo-trt: search-space: # Non-MTP configurations (default spec_decoding="none") - conc-list: [ 4 ] + recipe: "kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml" decode: num-worker: 4 tp: 8 ep: 8 dp-attn: false - conc-list: [ 156 ] + recipe: "kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml" decode: num-worker: 4 tp: 4 ep: 4 dp-attn: false - conc-list: [ 5, 15, 30, 60, 105 ] + recipe: "kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml" decode: num-worker: 5 tp: 4 ep: 4 dp-attn: false - conc-list: [ 333 ] + recipe: "kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml" prefill: num-worker: 2 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [ 615 ] + recipe: "kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml" prefill: num-worker: 3 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [ 2151 ] + recipe: "kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml" prefill: num-worker: 5 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [ 2253 ] + recipe: "kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml" prefill: num-worker: 7 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml - - "CONFIG_FILE=recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml" decode: num-worker: 1 tp: 16 @@ -7540,28 +6832,24 @@ kimik2.5-fp4-gb200-dynamo-vllm: osl: 1024 search-space: - conc-list: [256, 512, 1024, 2048, 3072, 4096] + recipe: "kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep4-dep16.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p1d-dep4-dep16.yaml - - "CONFIG_FILE=recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p1d-dep4-dep16.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [4, 8, 16, 32, 64, 128] + recipe: "kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p4d-dep4-tep4.yaml - - "CONFIG_FILE=recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p4d-dep4-tep4.yaml" decode: num-worker: 4 tp: 4 @@ -7571,56 +6859,48 @@ kimik2.5-fp4-gb200-dynamo-vllm: osl: 1024 search-space: - conc-list: [4, 8, 16, 32, 128] + recipe: "kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml" prefill: num-worker: 1 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-1p4d-dep4-tep4.yaml - - "CONFIG_FILE=recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-1p4d-dep4-tep4.yaml" decode: num-worker: 4 tp: 4 ep: 4 dp-attn: false - conc-list: [512, 1024] + recipe: "kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep4-dep16.yaml" prefill: num-worker: 3 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-3p1d-dep4-dep16.yaml - - "CONFIG_FILE=recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-3p1d-dep4-dep16.yaml" decode: num-worker: 1 tp: 16 ep: 16 dp-attn: true - conc-list: [2048] + recipe: "kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-5p1d-dep4-dep8.yaml" prefill: num-worker: 5 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml - - "CONFIG_FILE=recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml" decode: num-worker: 1 tp: 8 ep: 8 dp-attn: true - conc-list: [3072, 4096] + recipe: "kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-6p1d-dep4-dep16.yaml" prefill: num-worker: 6 tp: 4 ep: 4 dp-attn: true - additional-settings: - # https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-6p1d-dep4-dep16.yaml - - "CONFIG_FILE=recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-6p1d-dep4-dep16.yaml" decode: num-worker: 1 tp: 16 @@ -7647,13 +6927,12 @@ dsv4-fp4-gb200-dynamo-vllm: # Low latency: 1 prefill (DEP=8) + 1 decode (TP=8). 5 nodes total with # a dedicated NATS/etcd infra node. - conc-list: [1] + recipe: "dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-latency.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml" decode: num-worker: 1 tp: 8 @@ -7663,13 +6942,12 @@ dsv4-fp4-gb200-dynamo-vllm: # Low-middle curve: 1 prefill (DEP=8) + 4 decode (TP=8). 11 nodes total # with a dedicated NATS/etcd infra node. - conc-list: [256, 512] + recipe: "dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-middle-curve.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-middle-curve.yaml" decode: num-worker: 4 tp: 8 @@ -7679,13 +6957,12 @@ dsv4-fp4-gb200-dynamo-vllm: # Mid curve: 1 prefill (DEP=8) + 1 decode (DEP=8). 5 nodes total with # a dedicated NATS/etcd infra node. - conc-list: [256] + recipe: "dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-mid-curve.yaml" prefill: num-worker: 1 tp: 8 ep: 8 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb200-mid-curve.yaml" decode: num-worker: 1 tp: 8 @@ -7695,13 +6972,12 @@ dsv4-fp4-gb200-dynamo-vllm: # Max throughput: 3 prefill (DEP=8 each) + 1 decode (DEP=8). 9 nodes # total with a dedicated NATS/etcd infra node. - conc-list: [4096] + recipe: "dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt.yaml" prefill: num-worker: 3 tp: 8 ep: 8 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt.yaml" decode: num-worker: 1 tp: 8 @@ -7711,13 +6987,12 @@ dsv4-fp4-gb200-dynamo-vllm: # MegaMOE max throughput: same 3 prefill (DEP=8 each) + 1 decode (DEP=8) # shape, but uses deep_gemm_mega_moe on both workers and disables offload. - conc-list: [4096] + recipe: "dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt-megamoe.yaml" prefill: num-worker: 3 tp: 8 ep: 8 dp-attn: true - additional-settings: - - "CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt-megamoe.yaml" decode: num-worker: 1 tp: 8 diff --git a/.github/workflows/benchmark-multinode-tmpl.yml b/.github/workflows/benchmark-multinode-tmpl.yml index 75036a986..a8005096b 100644 --- a/.github/workflows/benchmark-multinode-tmpl.yml +++ b/.github/workflows/benchmark-multinode-tmpl.yml @@ -77,6 +77,11 @@ on: required: false type: string default: "[]" + recipe: + description: "Path under benchmarks/multi_node/srt-slurm-recipes/ identifying the srt-slurm recipe to dispatch. May carry an `:override[N]` suffix. Empty for non-srt-slurm multi-node configs." + required: false + type: string + default: "" run-eval: type: boolean required: false @@ -165,6 +170,7 @@ jobs: env: RUNNER_NAME: ${{ runner.name }} RUNNER_TYPE: ${{ inputs.runner }} + RECIPE: ${{ inputs.recipe }} # Hash uniquely on {EXP_NAME}_{PRECISION}_{FRAMEWORK}_prefill-tp{}-ep{}-dp{}-nw{}_decode-tp{}-ep{}-dp{}-nw{}_disagg-{}_spec-{}_conc{}_{runner} RESULT_FILENAME: ${{ env.EXP_NAME }}_${{ env.PRECISION }}_${{ env.FRAMEWORK }}_prefill-tp${{ env.PREFILL_TP }}-ep${{ env.PREFILL_EP }}-dp${{ env.PREFILL_DP_ATTN }}-nw${{ env.PREFILL_NUM_WORKERS }}_decode-tp${{ env.DECODE_TP }}-ep${{ env.DECODE_EP }}-dp${{ env.DECODE_DP_ATTN }}-nw${{ env.DECODE_NUM_WORKERS }}_disagg-${{ env.DISAGG }}_spec-${{ env.SPEC_DECODING }}_conc${{ join(fromJson(inputs.conc-list), 'x') }}_${{ runner.name }} run: | @@ -173,6 +179,15 @@ jobs: echo "RESULT_FILENAME=${RESULT_FILENAME}" >> $GITHUB_ENV export ${{ join(fromJson(inputs.prefill-additional-settings), ' ') }} ${{ join(fromJson(inputs.decode-additional-settings), ' ') }} + # RECIPE = "[:override[N]]" relative to benchmarks/multi_node/srt-slurm-recipes/. + # Copy the file to scratch so the launcher's `sed -i` rewrites don't mutate the + # tracked recipe between concurrent runs; preserve any :override suffix verbatim. + if [[ -n "$RECIPE" ]]; then + src="${GITHUB_WORKSPACE}/benchmarks/multi_node/srt-slurm-recipes/${RECIPE%%:*}" + scratch="$(mktemp -d)/$(basename "${RECIPE%%:*}")" + cp "$src" "$scratch" + export CONFIG_FILE="${scratch}${RECIPE#"${RECIPE%%:*}"}" + fi export IS_MULTINODE=true bash ./runners/launch_${RUNNER_NAME%%_*}.sh if [ "${{ inputs.eval-only }}" = "true" ]; then diff --git a/.github/workflows/e2e-tests.yml b/.github/workflows/e2e-tests.yml index 74d4889f3..f8961f7b4 100644 --- a/.github/workflows/e2e-tests.yml +++ b/.github/workflows/e2e-tests.yml @@ -102,6 +102,7 @@ jobs: decode-ep: ${{ matrix.config.decode.ep }} decode-dp-attn: ${{ matrix.config.decode.dp-attn }} decode-additional-settings: ${{ toJson(matrix.config.decode.additional-settings) }} + recipe: ${{ matrix.config.recipe }} run-eval: false ref: ${{ inputs.ref }} @@ -141,6 +142,7 @@ jobs: decode-ep: ${{ matrix.config.decode.ep }} decode-dp-attn: ${{ matrix.config.decode.dp-attn }} decode-additional-settings: ${{ toJson(matrix.config.decode.additional-settings) }} + recipe: ${{ matrix.config.recipe }} run-eval: true eval-only: true eval-conc: ${{ matrix.config.eval-conc }} diff --git a/.github/workflows/run-sweep.yml b/.github/workflows/run-sweep.yml index fd1fa91be..4dea7065a 100644 --- a/.github/workflows/run-sweep.yml +++ b/.github/workflows/run-sweep.yml @@ -138,6 +138,7 @@ jobs: decode-ep: ${{ matrix.config.decode.ep }} decode-dp-attn: ${{ matrix.config.decode.dp-attn }} decode-additional-settings: ${{ toJson(matrix.config.decode.additional-settings) }} + recipe: ${{ matrix.config.recipe }} run-eval: false sweep-multi-node-8k1k: @@ -257,6 +258,7 @@ jobs: decode-ep: ${{ matrix.config.decode.ep }} decode-dp-attn: ${{ matrix.config.decode.dp-attn }} decode-additional-settings: ${{ toJson(matrix.config.decode.additional-settings) }} + recipe: ${{ matrix.config.recipe }} run-eval: true eval-only: true eval-conc: ${{ matrix.config.eval-conc }} diff --git a/benchmarks/benchmark_lib.sh b/benchmarks/benchmark_lib.sh index 268745735..e1d94b1a6 100644 --- a/benchmarks/benchmark_lib.sh +++ b/benchmarks/benchmark_lib.sh @@ -206,6 +206,13 @@ run_benchmark_serving() { local dsv4=false local trust_remote_code=false local server_pid="" + # Optional --tokenizer / --endpoint pass-throughs for the multi-node + # srt_bench.sh. --tokenizer points the bench at the /model auto-mount + # (avoids relying on --model being a HF-resolvable id). --endpoint lets + # recipes target /v1/chat/completions when chat-template-only request + # paths are required. + local tokenizer="" + local endpoint="" while [[ $# -gt 0 ]]; do case $1 in @@ -270,6 +277,14 @@ run_benchmark_serving() { server_pid="$2" shift 2 ;; + --tokenizer) + tokenizer="$2" + shift 2 + ;; + --endpoint) + endpoint="$2" + shift 2 + ;; *) echo "Unknown parameter: $1" return 1 @@ -356,7 +371,15 @@ run_benchmark_serving() { --result-dir "$result_dir" --result-filename "$result_filename.json" ) - + + # Optional pass-throughs. + if [[ -n "$tokenizer" ]]; then + benchmark_cmd+=(--tokenizer "$tokenizer") + fi + if [[ -n "$endpoint" ]]; then + benchmark_cmd+=(--endpoint "$endpoint") + fi + # Add --use-chat-template if requested if [[ "$use_chat_template" == true ]]; then benchmark_cmd+=(--use-chat-template) @@ -862,3 +885,72 @@ run_eval() { fi return $eval_rc } + +# -------------------------------- +# Container helpers +# -------------------------------- + +# Sanitize a container image reference (e.g. "lmsysorg/sglang:v0.5.8-cu130") +# into a filename-safe slug by replacing /, :, @, # with the chosen separator. +# Defaults to '_' (most clusters); pass '+' for clusters that adopted that +# convention for their squash-file directory. +sanitize_image_filename() { + local image="$1" + local sep="${2:-_}" + echo "$image" | sed "s|[/:@#]|${sep}|g" +} + +# -------------------------------- +# srt-slurm helpers +# -------------------------------- + +# Clone srt-slurm and install `srtctl` into a uv venv. After this returns +# successfully, cwd is the cloned repo and the venv is active. Idempotent on +# uv: skips re-curl if the binary is already present at $UV_INSTALL_DIR. +# +# The srt-slurm commit is pinned (not env-var overridable) so every benchmark +# run uses the exact same srtctl. To bump it, edit the `ref=` line below. +# +# All other inputs are env vars (set before calling); all are optional: +# SRT_REPO_DIR default srt-slurm (relative to current cwd) +# UV_INSTALL_DIR default $HOME/.local/bin (uv's own default) +# UV_VENV_DIR default .venv (inside the cloned repo) +clone_and_install_srtctl() { + local repo_url="https://github.com/NVIDIA/srt-slurm.git" + # Pinned to NVIDIA/srt-slurm@main — currently 1372a10. Includes: + # * #110 nginx-rework-ulimit: gates `ulimit -n 1048576` + worker_rlimit_nofile + # behind opt-in `frontend.nginx_raise_ulimit` (we don't opt in). + # * #111 srun command line log demoted INFO -> DEBUG (5KB fingerprint + # heredoc no longer dominates orchestrator log). + local ref="1372a10c493e3fd757f342d8516a5a91c30fe6ce" + local repo_dir="${SRT_REPO_DIR:-srt-slurm}" + local uv_install_dir="${UV_INSTALL_DIR:-${HOME}/.local/bin}" + local uv_venv_dir="${UV_VENV_DIR:-.venv}" + + echo "Cloning ${repo_url}@${ref} into ${repo_dir}..." + rm -rf "$repo_dir" + git clone "$repo_url" "$repo_dir" + cd "$repo_dir" || return 1 + git checkout "$ref" + + echo "Installing uv + srtctl into venv at ${uv_venv_dir}..." + export UV_INSTALL_DIR="$uv_install_dir" + mkdir -p "$uv_install_dir" + if ! [ -x "$uv_install_dir/uv" ]; then + curl -LsSf https://astral.sh/uv/install.sh | sh + fi + export PATH="$uv_install_dir:$PATH" + # uv's installer drops an `env` script next to the binary; source it so + # PATH/PS1 changes pick up in shells that don't re-read the env. + [ -f "$uv_install_dir/env" ] && source "$uv_install_dir/env" + + uv venv "$uv_venv_dir" + # shellcheck disable=SC1091 + source "$uv_venv_dir/bin/activate" + uv pip install -e . + + if ! command -v srtctl &> /dev/null; then + echo "Error: Failed to install srtctl" >&2 + return 1 + fi +} diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml new file mode 100644 index 000000000..b08193bcb --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp4/1k1k/disagg/1k1k.yaml @@ -0,0 +1,259 @@ +# B200-FP4 1k1k — STP and MTP in one file +# +# Two inference modes distinguished by override key names: +# zip_override_stp_* — standard token prediction (no speculative decoding) +# zip_override_mtp_* — multi-token prediction (EAGLE speculative decoding) +# +# Low-latency variants: tep8 decode (DP=1), dep4 prefill (DP=4 TP=4) +# Max-throughput variants: dep8 decode (DP=8), adds SGLANG_MOE_NVFP4_DISPATCH +# +# Note: max-tpt 1d has max-running-requests=1024; max-tpt 2d keeps 512. +# MTP max-tpt 1d additionally uses mem-fraction=0.75 for decode. +# +# Usage: +# srtctl apply -f recipes/b200-fp4/1k1k.yaml # all 8 variants +# srtctl apply -f recipes/b200-fp4/1k1k.yaml:*stp* # all STP variants +# srtctl apply -f recipes/b200-fp4/1k1k.yaml:*mtp* # all MTP variants +# srtctl apply -f recipes/b200-fp4/1k1k.yaml:zip_override_stp_lowlat[0] # STP 1p5d only +# srtctl dry-run -f recipes/b200-fp4/1k1k.yaml # preview + +base: + name: "b200-fp4-stp-1k1k" + + model: + path: "dsr1" + container: "dynamo-sglang" + precision: "fp4" + + resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + decode_nodes: 5 + decode_workers: 5 + gpus_per_node: 8 + + backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + DYN_REQUEST_PLANE: nats + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + DYN_REQUEST_PLANE: nats + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + quantization: "modelopt_fp4" + + # Disaggregation mode + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.85 + max-prefill-tokens: 32768 + chunked-prefill-size: 32768 + context-length: 2200 + max-running-requests: 512 + disable-cuda-graph: true + + # Parallelism + tensor-parallel-size: 4 + data-parallel-size: 4 + expert-parallel-size: 4 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # MoE + moe-runner-backend: "flashinfer_trtllm" + moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + quantization: "modelopt_fp4" + + # Disaggregation mode + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.85 + max-prefill-tokens: 32768 + chunked-prefill-size: 32768 + context-length: 2200 + max-running-requests: 512 + cuda-graph-max-bs: 512 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 1 + expert-parallel-size: 8 + + # Attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # MoE + moe-runner-backend: "flashinfer_trtllm" + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + health_check: + max_attempts: 360 + interval_seconds: 10 + + benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + req_rate: "inf" + + +# STP low-latency: tep8 decode (DP=1), scale sweep 1p5d and 1p6d +zip_override_stp_lowlat: + name: + - "b200-fp4-stp-low-latency-dep4-1p-tep8-5d" + - "b200-fp4-stp-low-latency-dep4-1p-tep8-6d" + resources: + decode_nodes: [5, 6] + decode_workers: [5, 6] + benchmark: + concurrencies: ["16x128", "32x64x256"] + + +# MTP low-latency: same scales as STP, adds EAGLE speculative decoding + fp4-gemm-backend +zip_override_mtp_lowlat: + name: + - "b200-fp4-mtp-low-latency-dep4-1p-tep8-5d" + - "b200-fp4-mtp-low-latency-dep4-1p-tep8-6d" + resources: + decode_nodes: [5, 6] + decode_workers: [5, 6] + backend: + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + decode_environment: + SGLANG_ENABLE_SPEC_V2: "1" + sglang_config: + prefill: + fp4-gemm-backend: "flashinfer_trtllm" + decode: + fp4-gemm-backend: "flashinfer_trtllm" + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + benchmark: + concurrencies: ["16x512", "32x64x256x512"] + + +# STP max-throughput: dep8 decode (DP=8), scale sweep 1p1d and 1p2d +# Adds SGLANG_MOE_NVFP4_DISPATCH + SGLANG_FLASHINFER_FP4_GEMM_BACKEND env vars +# 1d: max-running-requests=1024; 2d: keeps 512 +zip_override_stp_maxtpt: + name: + - "b200-fp4-stp-max-tpt-dep4-1p-dep8-1d" + - "b200-fp4-stp-max-tpt-dep4-1p-dep8-2d" + resources: + decode_nodes: [1, 2] + decode_workers: [1, 2] + backend: + decode_environment: + SGLANG_MOE_NVFP4_DISPATCH: "1" + SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass" + sglang_config: + prefill: + max-running-requests: [1024, 512] + decode: + data-parallel-size: 8 + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 + max-running-requests: [1024, 512] + cuda-graph-max-bs: [1024, 512] + benchmark: + concurrencies: ["512", "512"] + + +# MTP max-throughput: dep8 decode, scale sweep 1p1d and 1p2d, adds EAGLE speculative decoding +# Adds SGLANG_MOE_NVFP4_DISPATCH + SGLANG_FLASHINFER_FP4_GEMM_BACKEND + fp4-gemm-backend +# 1d: max-running-requests=1024, mem-fraction=0.75 for decode; 2d: keeps 512/0.85 +zip_override_mtp_maxtpt: + name: + - "b200-fp4-mtp-max-tpt-dep4-1p-dep8-1d" + - "b200-fp4-mtp-max-tpt-dep4-1p-dep8-2d" + resources: + decode_nodes: [1, 2] + decode_workers: [1, 2] + backend: + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + decode_environment: + SGLANG_MOE_NVFP4_DISPATCH: "1" + SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass" + SGLANG_ENABLE_SPEC_V2: "1" + sglang_config: + prefill: + fp4-gemm-backend: "flashinfer_trtllm" + max-running-requests: [1024, 512] + decode: + fp4-gemm-backend: "flashinfer_trtllm" + mem-fraction-static: [0.75, 0.85] + data-parallel-size: 8 + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 + max-running-requests: [1024, 512] + cuda-graph-max-bs: [1024, 512] + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + benchmark: + concurrencies: ["512x1024", "512"] diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml new file mode 100644 index 000000000..f5bfc9641 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp4/8k1k/disagg/8k1k.yaml @@ -0,0 +1,351 @@ +# B200-FP4 8k1k — STP and MTP in one file +# +# Three modes distinguished by override key names: +# override_stp_tp4 / override_mtp_tp4: TP4 prefill (DP=1, EP=1) — low-latency single-node +# zip_override_stp_lowlat / zip_override_mtp_lowlat: dep4 prefill + tep8 decode (DP=1) +# override_stp_maxtpt_7p2d / override_mtp_maxtpt_7p2d: dep4 prefill + dep8 decode, 7p2d +# override_mtp_maxtpt_4p1d: MTP-only 4p1d, no frontends, env-var FP4 backend +# +# Usage: +# srtctl apply -f recipes/b200-fp4/8k1k.yaml # all 11 variants +# srtctl apply -f recipes/b200-fp4/8k1k.yaml:*stp* # all STP variants +# srtctl apply -f recipes/b200-fp4/8k1k.yaml:*mtp* # all MTP variants +# srtctl apply -f recipes/b200-fp4/8k1k.yaml:override_stp_tp4 # STP tp4 only +# srtctl apply -f recipes/b200-fp4/8k1k.yaml:zip_override_stp_lowlat[0] # STP 1p1d only +# srtctl dry-run -f recipes/b200-fp4/8k1k.yaml # preview + +base: + name: "b200-fp4-stp-8k1k" + + dynamo: + version: 0.8.1 + + model: + path: "dsr1" + container: "dynamo-sglang" + precision: "fp4" + + frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 4 + + resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + decode_nodes: 1 + decode_workers: 1 + gpus_per_node: 8 + + backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + DYN_REQUEST_PLANE: nats + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + DYN_REQUEST_PLANE: nats + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + quantization: "modelopt_fp4" + + # Disaggregation mode + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.85 + max-prefill-tokens: 32768 + chunked-prefill-size: 32768 + context-length: 9600 + max-running-requests: 512 + disable-cuda-graph: true + + # Parallelism + tensor-parallel-size: 4 + data-parallel-size: 4 + expert-parallel-size: 4 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # MoE + moe-runner-backend: "flashinfer_trtllm" + moe-dense-tp-size: 1 + fp4-gemm-backend: "flashinfer_trtllm" + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + quantization: "modelopt_fp4" + + # Disaggregation mode + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.85 + max-prefill-tokens: 32768 + chunked-prefill-size: 32768 + context-length: 9600 + max-running-requests: 512 + cuda-graph-max-bs: 512 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 1 + expert-parallel-size: 8 + + # Attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # MoE + moe-runner-backend: "flashinfer_trtllm" + fp4-gemm-backend: "flashinfer_trtllm" + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + health_check: + max_attempts: 360 + interval_seconds: 10 + + benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + req_rate: "inf" + + +# STP TP4 prefill mode: TP4 (DP=1, EP=1) instead of dep4 — low-latency single-node +override_stp_tp4: + name: "b200-fp4-stp-low-latency-tp4-1p-tp8-1d" + frontend: + num_additional_frontends: 2 + backend: + sglang_config: + prefill: + data-parallel-size: 1 + expert-parallel-size: 1 + enable-dp-attention: null + enable-dp-lm-head: null + decode: + expert-parallel-size: 1 + benchmark: + concurrencies: "4x8x16x64" + + +# MTP TP4 prefill mode: same as STP tp4 but adds EAGLE speculative decoding +override_mtp_tp4: + name: "b200-fp4-mtp-low-latency-tp4-1p-tp8-1d" + frontend: + num_additional_frontends: 2 + backend: + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + decode_environment: + SGLANG_ENABLE_SPEC_V2: "1" + sglang_config: + prefill: + data-parallel-size: 1 + expert-parallel-size: 1 + enable-dp-attention: null + enable-dp-lm-head: null + decode: + expert-parallel-size: 1 + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + benchmark: + concurrencies: "4x8x16x64" + + +# STP low-latency: dep4 prefill + tep8 decode (DP=1), scale sweep 1p1d/1p5d/2p5d +zip_override_stp_lowlat: + name: + - "b200-fp4-stp-low-latency-dep4-1p-tep8-1d" + - "b200-fp4-stp-low-latency-dep4-1p-tep8-5d" + - "b200-fp4-stp-low-latency-dep4-2p-tep8-5d" + resources: + prefill_nodes: [1, 1, 2] + prefill_workers: [1, 1, 2] + decode_nodes: [1, 5, 5] + decode_workers: [1, 5, 5] + benchmark: + concurrencies: ["64x128", "8", "4x128"] + + +# MTP low-latency: same scales as STP, adds EAGLE speculative decoding +zip_override_mtp_lowlat: + name: + - "b200-fp4-mtp-low-latency-dep4-1p-tep8-1d" + - "b200-fp4-mtp-low-latency-dep4-1p-tep8-5d" + - "b200-fp4-mtp-low-latency-dep4-2p-tep8-5d" + resources: + prefill_nodes: [1, 1, 2] + prefill_workers: [1, 1, 2] + decode_nodes: [1, 5, 5] + decode_workers: [1, 5, 5] + backend: + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + decode_environment: + SGLANG_ENABLE_SPEC_V2: "1" + sglang_config: + decode: + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + benchmark: + concurrencies: ["64x128", "8", "4x128"] + + +# STP max-throughput 7p2d: dep4 prefill + dep8 decode, flashinfer_cutlass backend +override_stp_maxtpt_7p2d: + name: "b200-fp4-stp-max-tpt-dep4-7p-dep8-2d" + resources: + prefill_nodes: 7 + prefill_workers: 7 + decode_nodes: 2 + decode_workers: 2 + backend: + decode_environment: + SGLANG_MOE_NVFP4_DISPATCH: "1" + sglang_config: + prefill: + max-prefill-tokens: 65536 + chunked-prefill-size: 65536 + max-running-requests: 1024 + fp4-gemm-backend: "flashinfer_cutlass" + decode: + data-parallel-size: 8 + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 + max-running-requests: 2048 + cuda-graph-max-bs: 1024 + fp4-gemm-backend: "flashinfer_cutlass" + benchmark: + concurrencies: "1024x2048" + + +# MTP max-throughput 7p2d: same as STP but adds EAGLE speculative decoding +override_mtp_maxtpt_7p2d: + name: "b200-fp4-mtp-max-tpt-dep4-7p-dep8-2d" + resources: + prefill_nodes: 7 + prefill_workers: 7 + decode_nodes: 2 + decode_workers: 2 + backend: + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + decode_environment: + SGLANG_MOE_NVFP4_DISPATCH: "1" + SGLANG_ENABLE_SPEC_V2: "1" + sglang_config: + prefill: + max-prefill-tokens: 65536 + chunked-prefill-size: 65536 + max-running-requests: 1024 + fp4-gemm-backend: "flashinfer_cutlass" + decode: + data-parallel-size: 8 + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 + max-running-requests: 2048 + cuda-graph-max-bs: 1024 + fp4-gemm-backend: "flashinfer_cutlass" + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + benchmark: + concurrencies: "1024x2048" + + +# MTP-only: 4p1d, no frontends, SGLANG_FLASHINFER_FP4_GEMM_BACKEND env var (fp4-gemm-backend: null +# removes the sglang_config key), mem-fraction=0.75 for decode +override_mtp_maxtpt_4p1d: + name: "b200-fp4-mtp-max-tpt-dep4-4p-dep8-1d" + dynamo: null + frontend: null + resources: + prefill_nodes: 4 + prefill_workers: 4 + decode_nodes: 1 + decode_workers: 1 + backend: + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + decode_environment: + SGLANG_MOE_NVFP4_DISPATCH: "1" + SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass" + SGLANG_ENABLE_SPEC_V2: "1" + sglang_config: + prefill: + max-running-requests: 1024 + fp4-gemm-backend: null + decode: + mem-fraction-static: 0.75 + data-parallel-size: 8 + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 + max-running-requests: 1024 + cuda-graph-max-bs: 1024 + fp4-gemm-backend: null + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + benchmark: + concurrencies: "1024" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml new file mode 100644 index 000000000..7489586aa --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/1k1k/disagg/1k1k.yaml @@ -0,0 +1,281 @@ +# B200-FP8 1k1k — STP and MTP in one file +# +# Two inference modes distinguished by override key names: +# zip_override_stp_* — standard token prediction (no speculative decoding) +# zip_override_mtp_* — multi-token prediction (EAGLE speculative decoding) +# +# Low-latency variants: tep8 decode (DP=1) +# Max-throughput variants: dep8 decode (DP=8) +# +# Usage: +# srtctl apply -f recipes/b200-fp8/1k1k.yaml # all 10 variants +# srtctl apply -f recipes/b200-fp8/1k1k.yaml:*stp* # all STP variants +# srtctl apply -f recipes/b200-fp8/1k1k.yaml:*mtp* # all MTP variants +# srtctl apply -f recipes/b200-fp8/1k1k.yaml:zip_override_stp_lowlat[0] # STP 1p1d only +# srtctl dry-run -f recipes/b200-fp8/1k1k.yaml # preview + +base: + name: "b200-fp8-stp-1k1k" + + model: + path: "dsr1-fp8" + container: "dynamo-sglang" + precision: "fp8" + + resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 1 + decode_workers: 1 + gpus_per_node: 8 + + backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + DYN_REQUEST_PLANE: nats + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + DYN_REQUEST_PLANE: nats + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + quantization: "fp8" + + # Disaggregation mode + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.85 + max-prefill-tokens: 32768 + chunked-prefill-size: 32768 + context-length: 2200 + max-running-requests: 512 + disable-cuda-graph: true + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 1 + expert-parallel-size: 8 + + # Attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # MoE + moe-runner-backend: "flashinfer_trtllm" + # moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + quantization: "fp8" + + # Disaggregation mode + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.85 + max-prefill-tokens: 32768 + chunked-prefill-size: 32768 + context-length: 2200 + max-running-requests: 512 + cuda-graph-max-bs: 512 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 1 + expert-parallel-size: 8 + + # Attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # MoE + moe-runner-backend: "flashinfer_trtllm" + # moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + # disable-chunked-prefix-cache: true + + health_check: + max_attempts: 360 + interval_seconds: 10 + + benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + req_rate: "inf" + + +# STP low-latency: tep8 decode (DP=1), scale sweep 1p1d and 1p3d +zip_override_stp_lowlat: + name: + - "b200-fp8-stp-low-latency-tep8-1p-1d" + - "b200-fp8-stp-low-latency-tep8-1p-3d" + resources: + decode_nodes: [1, 3] + decode_workers: [1, 3] + benchmark: + concurrencies: ["4", "16x32x64x128x256"] + + +# MTP low-latency: same scales as STP, adds EAGLE speculative decoding +zip_override_mtp_lowlat: + name: + - "b200-fp8-mtp-low-latency-tep8-1p-1d" + - "b200-fp8-mtp-low-latency-tep8-1p-3d" + resources: + decode_nodes: [1, 3] + decode_workers: [1, 3] + backend: + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + decode_environment: + SGLANG_ENABLE_SPEC_V2: "1" + sglang_config: + prefill: + moe-dense-tp-size: 1 + decode: + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + benchmark: + concurrencies: ["4x64", "4x8x16x32x128"] + + +# STP max-throughput: dep8 decode (DP=8), scale sweep 1p5d and 2p5d +zip_override_stp_maxtpt: + name: + - "b200-fp8-stp-max-tpt-dep8-1p-5d" + - "b200-fp8-stp-max-tpt-dep8-2p-5d" + resources: + prefill_nodes: [1, 2] + prefill_workers: [1, 2] + decode_nodes: [5, 5] + decode_workers: [5, 5] + backend: + sglang_config: + prefill: + data-parallel-size: 8 + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 + max-running-requests: 1024 + decode: + data-parallel-size: 8 + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 + max-running-requests: 1024 + cuda-graph-max-bs: 1024 + benchmark: + concurrencies: ["1024", "2048"] + + +# MTP max-throughput: dep8 decode, scale sweep 1p1d/1p5d/2p5d, adds EAGLE speculative decoding +# Note: max-running-requests stays at 512 for MTP (unlike STP which raises to 1024) +zip_override_mtp_maxtpt: + name: + - "b200-fp8-mtp-max-tpt-dep8-1p-1d" + - "b200-fp8-mtp-max-tpt-dep8-1p-5d" + - "b200-fp8-mtp-max-tpt-dep8-2p-5d" + resources: + prefill_nodes: [1, 1, 2] + prefill_workers: [1, 1, 2] + decode_nodes: [1, 5, 5] + decode_workers: [1, 5, 5] + backend: + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + decode_environment: + SGLANG_ENABLE_SPEC_V2: "1" + sglang_config: + prefill: + data-parallel-size: 8 + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 + decode: + data-parallel-size: 8 + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + benchmark: + concurrencies: ["512x1024x2048x4096", "512x4096", "1024x2048x4096"] + + +# MTP special case: 1p2d uses speculative-num-steps=1 and draft-tokens=2 (vs 2/3 for all others) +override_mtp_maxtpt_1p2d: + name: "b200-fp8-mtp-max-tpt-dep8-1p-2d" + resources: + decode_nodes: 2 + decode_workers: 2 + backend: + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + decode_environment: + SGLANG_ENABLE_SPEC_V2: "1" + sglang_config: + prefill: + data-parallel-size: 8 + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 + decode: + data-parallel-size: 8 + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 + speculative-algorithm: "EAGLE" + speculative-num-steps: 1 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 2 + benchmark: + concurrencies: "512x1024x2048" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_0.yaml new file mode 100644 index 000000000..36b78e975 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_0.yaml @@ -0,0 +1,148 @@ +name: b200-fp8-mtp-low-latency-tep8-1p-1d + +dynamo: + version: 0.9.1 + +model: + path: dsr1-fp8 + container: dynamo-sglang + precision: fp8 + +resources: + gpu_type: b200 + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 3 + decode_workers: 3 + gpus_per_node: 8 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + SGLANG_ENABLE_SPEC_V2: '1' + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + SGLANG_ENABLE_SPEC_V2: '1' + sglang_config: + prefill: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: prefill + disaggregation-transfer-backend: nixl + load-balance-method: round_robin + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 8192 + chunked-prefill-size: 65536 + max-running-requests: 8 + context-length: 9600 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 1 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: decode + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + context-length: 9600 + max-running-requests: 32 + cuda-graph-max-bs: 32 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 1 + expert-parallel-size: 1 + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + speculative-algorithm: EAGLE + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 +health_check: + max_attempts: 720 + interval_seconds: 10 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_1.yaml new file mode 100644 index 000000000..0fed3f9a6 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_1.yaml @@ -0,0 +1,148 @@ +name: b200-fp8-mtp-low-latency-tep8-1p-4d + +dynamo: + version: 0.9.1 + +model: + path: dsr1-fp8 + container: dynamo-sglang + precision: fp8 + +resources: + gpu_type: b200 + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 4 + decode_workers: 4 + gpus_per_node: 8 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + SGLANG_ENABLE_SPEC_V2: '1' + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + SGLANG_ENABLE_SPEC_V2: '1' + sglang_config: + prefill: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: prefill + disaggregation-transfer-backend: nixl + load-balance-method: round_robin + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 8192 + chunked-prefill-size: 65536 + max-running-requests: 8 + context-length: 9600 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 1 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: decode + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + context-length: 9600 + max-running-requests: 32 + cuda-graph-max-bs: 32 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 1 + expert-parallel-size: 1 + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + speculative-algorithm: EAGLE + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 +health_check: + max_attempts: 720 + interval_seconds: 10 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "40" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_2.yaml new file mode 100644 index 000000000..e39611a4b --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_lowlat_2.yaml @@ -0,0 +1,148 @@ +name: b200-fp8-mtp-low-latency-tep8-1p-6d + +dynamo: + version: 0.9.1 + +model: + path: dsr1-fp8 + container: dynamo-sglang + precision: fp8 + +resources: + gpu_type: b200 + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 6 + decode_workers: 6 + gpus_per_node: 8 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + SGLANG_ENABLE_SPEC_V2: '1' + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + SGLANG_ENABLE_SPEC_V2: '1' + sglang_config: + prefill: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: prefill + disaggregation-transfer-backend: nixl + load-balance-method: round_robin + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 8192 + chunked-prefill-size: 65536 + max-running-requests: 8 + context-length: 9600 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 1 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: decode + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + context-length: 9600 + max-running-requests: 22 + cuda-graph-max-bs: 22 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 1 + expert-parallel-size: 1 + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + speculative-algorithm: EAGLE + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 +health_check: + max_attempts: 720 + interval_seconds: 10 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_0.yaml new file mode 100644 index 000000000..78dc57d5a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_0.yaml @@ -0,0 +1,151 @@ +name: b200-fp8-mtp-max-tpt-dep8-1p-1d + +dynamo: + version: 0.9.1 + +model: + path: dsr1-fp8 + container: dynamo-sglang + precision: fp8 + +resources: + gpu_type: b200 + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 2 + decode_workers: 2 + gpus_per_node: 8 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + SGLANG_ENABLE_SPEC_V2: '1' + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + SGLANG_ENABLE_SPEC_V2: '1' + sglang_config: + prefill: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: prefill + disaggregation-transfer-backend: nixl + load-balance-method: round_robin + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 8192 + chunked-prefill-size: 65536 + max-running-requests: 8 + context-length: 9600 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 1 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: decode + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + context-length: 9600 + max-running-requests: 128 + cuda-graph-max-bs: 16 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 8 + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 + speculative-algorithm: EAGLE + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 +health_check: + max_attempts: 720 + interval_seconds: 10 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_1.yaml new file mode 100644 index 000000000..202a10631 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_1.yaml @@ -0,0 +1,151 @@ +name: b200-fp8-mtp-max-tpt-dep8-1p-2d + +dynamo: + version: 0.9.1 + +model: + path: dsr1-fp8 + container: dynamo-sglang + precision: fp8 + +resources: + gpu_type: b200 + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 1 + decode_workers: 1 + gpus_per_node: 8 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + SGLANG_ENABLE_SPEC_V2: '1' + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + SGLANG_ENABLE_SPEC_V2: '1' + sglang_config: + prefill: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: prefill + disaggregation-transfer-backend: nixl + load-balance-method: round_robin + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 8192 + chunked-prefill-size: 65536 + max-running-requests: 8 + context-length: 9600 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 1 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: decode + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + context-length: 9600 + max-running-requests: 256 + cuda-graph-max-bs: 32 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 8 + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 + speculative-algorithm: EAGLE + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 +health_check: + max_attempts: 720 + interval_seconds: 10 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_2.yaml new file mode 100644 index 000000000..e2a619e29 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_2.yaml @@ -0,0 +1,151 @@ +name: b200-fp8-mtp-max-tpt-dep8-2p-1d + +dynamo: + version: 0.9.1 + +model: + path: dsr1-fp8 + container: dynamo-sglang + precision: fp8 + +resources: + gpu_type: b200 + prefill_nodes: 2 + prefill_workers: 2 + decode_nodes: 1 + decode_workers: 1 + gpus_per_node: 8 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + SGLANG_ENABLE_SPEC_V2: '1' + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + SGLANG_ENABLE_SPEC_V2: '1' + sglang_config: + prefill: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: prefill + disaggregation-transfer-backend: nixl + load-balance-method: round_robin + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 8192 + chunked-prefill-size: 65536 + max-running-requests: 8 + context-length: 9600 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 1 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: decode + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + context-length: 9600 + max-running-requests: 512 + cuda-graph-max-bs: 64 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 8 + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 + speculative-algorithm: EAGLE + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 +health_check: + max_attempts: 720 + interval_seconds: 10 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_3.yaml new file mode 100644 index 000000000..5e959ca38 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/8k1k_mtp_maxtpt_3.yaml @@ -0,0 +1,151 @@ +name: b200-fp8-mtp-max-tpt-dep8-3p-1d + +dynamo: + version: 0.9.1 + +model: + path: dsr1-fp8 + container: dynamo-sglang + precision: fp8 + +resources: + gpu_type: b200 + prefill_nodes: 3 + prefill_workers: 3 + decode_nodes: 1 + decode_workers: 1 + gpus_per_node: 8 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + SGLANG_ENABLE_SPEC_V2: '1' + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + SGLANG_ENABLE_SPEC_V2: '1' + sglang_config: + prefill: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: prefill + disaggregation-transfer-backend: nixl + load-balance-method: round_robin + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 8192 + chunked-prefill-size: 65536 + max-running-requests: 8 + context-length: 9600 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 1 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: decode + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + context-length: 9600 + max-running-requests: 1024 + cuda-graph-max-bs: 128 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 8 + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 + speculative-algorithm: EAGLE + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 +health_check: + max_attempts: 720 + interval_seconds: 10 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_0.yaml new file mode 100644 index 000000000..24d37e3ee --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_0.yaml @@ -0,0 +1,146 @@ +name: b200-fp8-stp-low-latency-tp8-1p-3d + +dynamo: + version: 0.9.1 + +model: + path: dsr1-fp8 + container: dynamo-sglang + precision: fp8 + +resources: + gpu_type: b200 + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 3 + decode_workers: 3 + gpus_per_node: 8 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + + sglang_config: + prefill: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: prefill + disaggregation-transfer-backend: nixl + load-balance-method: round_robin + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 8192 + chunked-prefill-size: 65536 + max-running-requests: 8 + context-length: 9600 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 1 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: decode + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + context-length: 9600 + max-running-requests: 32 + cuda-graph-max-bs: 32 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 1 + expert-parallel-size: 1 + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + # disable-chunked-prefix-cache: true + +health_check: + max_attempts: 360 + interval_seconds: 10 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_1.yaml new file mode 100644 index 000000000..c97d109d9 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_1.yaml @@ -0,0 +1,146 @@ +name: b200-fp8-stp-low-latency-tp8-1p-4d + +dynamo: + version: 0.9.1 + +model: + path: dsr1-fp8 + container: dynamo-sglang + precision: fp8 + +resources: + gpu_type: b200 + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 4 + decode_workers: 4 + gpus_per_node: 8 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + + sglang_config: + prefill: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: prefill + disaggregation-transfer-backend: nixl + load-balance-method: round_robin + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 8192 + chunked-prefill-size: 65536 + max-running-requests: 8 + context-length: 9600 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 1 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: decode + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + context-length: 9600 + max-running-requests: 32 + cuda-graph-max-bs: 32 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 1 + expert-parallel-size: 1 + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + # disable-chunked-prefix-cache: true + +health_check: + max_attempts: 360 + interval_seconds: 10 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "40" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_2.yaml new file mode 100644 index 000000000..503f1363b --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_2.yaml @@ -0,0 +1,146 @@ +name: b200-fp8-stp-low-latency-tp8-1p-6d + +dynamo: + version: 0.9.1 + +model: + path: dsr1-fp8 + container: dynamo-sglang + precision: fp8 + +resources: + gpu_type: b200 + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 6 + decode_workers: 6 + gpus_per_node: 8 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + + sglang_config: + prefill: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: prefill + disaggregation-transfer-backend: nixl + load-balance-method: round_robin + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 8192 + chunked-prefill-size: 65536 + max-running-requests: 8 + context-length: 9600 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 1 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: decode + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + context-length: 9600 + max-running-requests: 22 + cuda-graph-max-bs: 22 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 1 + expert-parallel-size: 1 + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + # disable-chunked-prefix-cache: true + +health_check: + max_attempts: 360 + interval_seconds: 10 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_0.yaml new file mode 100644 index 000000000..cb8d13717 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_0.yaml @@ -0,0 +1,147 @@ +name: b200-fp8-stp-max-tpt-dep8-1p-2d + +dynamo: + version: 0.9.1 + +model: + path: dsr1-fp8 + container: dynamo-sglang + precision: fp8 + +resources: + gpu_type: b200 + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 2 + decode_workers: 2 + gpus_per_node: 8 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + + sglang_config: + prefill: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: prefill + disaggregation-transfer-backend: nixl + load-balance-method: round_robin + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 8192 + chunked-prefill-size: 65536 + max-running-requests: 8 + context-length: 9600 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 1 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: decode + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + context-length: 9600 + max-running-requests: 128 + cuda-graph-max-bs: 128 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 8 + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 +health_check: + max_attempts: 360 + interval_seconds: 10 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_1.yaml new file mode 100644 index 000000000..875893e72 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_1.yaml @@ -0,0 +1,147 @@ +name: b200-fp8-stp-max-tpt-dep8-1p-1d + +dynamo: + version: 0.9.1 + +model: + path: dsr1-fp8 + container: dynamo-sglang + precision: fp8 + +resources: + gpu_type: b200 + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 1 + decode_workers: 1 + gpus_per_node: 8 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + + sglang_config: + prefill: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: prefill + disaggregation-transfer-backend: nixl + load-balance-method: round_robin + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 8192 + chunked-prefill-size: 65536 + max-running-requests: 8 + context-length: 9600 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 1 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: decode + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + context-length: 9600 + max-running-requests: 256 + cuda-graph-max-bs: 256 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 8 + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 +health_check: + max_attempts: 360 + interval_seconds: 10 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_2.yaml new file mode 100644 index 000000000..1402c1202 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_2.yaml @@ -0,0 +1,147 @@ +name: b200-fp8-stp-max-tpt-dep8-2p-1d + +dynamo: + version: 0.9.1 + +model: + path: dsr1-fp8 + container: dynamo-sglang + precision: fp8 + +resources: + gpu_type: b200 + prefill_nodes: 2 + prefill_workers: 2 + decode_nodes: 1 + decode_workers: 1 + gpus_per_node: 8 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + + sglang_config: + prefill: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: prefill + disaggregation-transfer-backend: nixl + load-balance-method: round_robin + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 8192 + chunked-prefill-size: 65536 + max-running-requests: 8 + context-length: 9600 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 1 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: decode + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + context-length: 9600 + max-running-requests: 512 + cuda-graph-max-bs: 512 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 8 + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 +health_check: + max_attempts: 360 + interval_seconds: 10 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_3.yaml new file mode 100644 index 000000000..a689bf0ac --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/stp/8k1k_stp_maxtpt_3.yaml @@ -0,0 +1,147 @@ +name: b200-fp8-stp-max-tpt-dep8-3p-1d + +dynamo: + version: 0.9.1 + +model: + path: dsr1-fp8 + container: dynamo-sglang + precision: fp8 + +resources: + gpu_type: b200 + prefill_nodes: 3 + prefill_workers: 3 + decode_nodes: 1 + decode_workers: 1 + gpus_per_node: 8 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' + PYTHONUNBUFFERED: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: '1000' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' + MC_FORCE_MNNVL: '1' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + DYN_REQUEST_PLANE: nats + CUDA_SCALE_LAUNCH_QUEUES: 4x + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: '1' + + sglang_config: + prefill: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: prefill + disaggregation-transfer-backend: nixl + load-balance-method: round_robin + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 8192 + chunked-prefill-size: 65536 + max-running-requests: 8 + context-length: 9600 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 1 + enable-dp-attention: true + enable-dp-lm-head: true + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + moe-dense-tp-size: 1 + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + + decode: + # Model configuration + served-model-name: deepseek-ai/DeepSeek-R1 + trust-remote-code: true + quantization: fp8 + + # Disaggregation mode + disaggregation-mode: decode + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + context-length: 9600 + max-running-requests: 1024 + cuda-graph-max-bs: 1024 + + # Parallelism + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 8 + + # Attention + attention-backend: trtllm_mla + kv-cache-dtype: fp8_e4m3 + + # MoE + moe-runner-backend: flashinfer_trtllm + + # Other flags + stream-interval: 30 + watchdog-timeout: 1000000 + enable-flashinfer-allreduce-fusion: true + disable-radix-cache: true + enable-dp-attention: true + enable-dp-lm-head: true + moe-dense-tp-size: 1 +health_check: + max_attempts: 360 + interval_seconds: 10 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/low-latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/low-latency.yaml new file mode 100644 index 000000000..b280e7176 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/low-latency.yaml @@ -0,0 +1,128 @@ +name: "gb200-fp4-1k1k-low-latency" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 3 + nginx_container: nginx-sqsh + +model: + path: "dsr1-fp4" + container: "dynamo-sglang" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 1 + decode_nodes: 2 + prefill_workers: 1 + decode_workers: 2 + gpus_per_node: 4 + +backend: + + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + + sglang_config: + prefill: + disaggregation-mode: "prefill" + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + disable-radix-cache: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_trtllm" + stream-interval: 10 + watchdog-timeout: 1000000 + context-length: 2200 + mem-fraction-static: 0.95 + max-total-tokens: 8192 + chunked-prefill-size: 8192 + cuda-graph-max-bs: 256 + max-running-requests: 512 + scheduler-recv-interval: 10 + enable-symm-mem: true + load-balance-method: "round_robin" + disaggregation-bootstrap-port: 30001 + disaggregation-transfer-backend: nixl + fp4-gemm-backend: "flashinfer_trtllm" + data-parallel-size: 1 + tensor-parallel-size: 4 + expert-parallel-size: 1 + + decode: + disaggregation-mode: "decode" + served-model-name: "deepseek-ai/DeepSeek-R1" + prefill-round-robin-balance: true + trust-remote-code: true + disable-radix-cache: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_trtllm" + disaggregation-bootstrap-port: 30001 + stream-interval: 10 + watchdog-timeout: 1000000 + context-length: 2200 + mem-fraction-static: 0.95 + chunked-prefill-size: 8192 + cuda-graph-max-bs: 256 + scheduler-recv-interval: 10 + enable-symm-mem: true + disaggregation-transfer-backend: nixl + fp4-gemm-backend: "flashinfer_trtllm" + tensor-parallel-size: 4 + expert-parallel-size: 1 + +# InferenceX bench-serving wrapper, invoked via srt-slurm `benchmark.type: custom`. +# Most env (MODEL, ISL, OSL, CONC_LIST, DISAGG) is exported by +# benchmark-multinode-tmpl.yml and propagated through srtctl → srun → pyxis, +# so the recipe only carries per-recipe knobs that have no workflow source. +# See benchmarks/multi_node/srt_bench.sh for the full env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + # Override $MODEL because this sglang recipe advertises a different + # served-model-name from what master-yaml's `model:` field is set to. + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "4" + DECODE_GPUS: "4" + TOTAL_GPUS: "12" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/max-tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/max-tpt.yaml new file mode 100644 index 000000000..eb499618e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/max-tpt.yaml @@ -0,0 +1,190 @@ +name: "gb200-fp4-1k1k-max-tpt" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 9 + nginx_container: nginx-sqsh + +model: + path: "dsr1-fp4" + container: "dynamo-sglang" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 4 + decode_nodes: 12 + prefill_workers: 4 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024" + SGLANG_MOE_NVFP4_DISPATCH: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_cutlass" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 2176 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.84 + max-total-tokens: 131072 + max-prefill-tokens: 32768 + chunked-prefill-size: 65536 + enable-single-batch-overlap: true + + # Request handling + max-running-requests: 30000 + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + enable-dp-attention: true + fp4-gemm-backend: "flashinfer_cutlass" + disaggregation-transfer-backend: nixl + + # Parallelism + tp-size: 4 + dp-size: 4 + ep-size: 4 + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_cutedsl" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 2176 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.83 + max-total-tokens: 3122380 + chunked-prefill-size: 786432 + + # Request handling + max-running-requests: 67584 + enable-single-batch-overlap: true + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + ep-num-redundant-experts: 32 + + # CUDA graphs (extensive batch size list) + cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 1024] + num-reserved-decode-tokens: 112 + + # Additional decode optimizations + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + enable-dp-attention: true + disaggregation-transfer-backend: nixl + fp4-gemm-backend: "flashinfer_cutlass" + + # Parallelism + tp-size: 48 + dp-size: 48 + ep-size: 48 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "4" + DECODE_GPUS: "48" + TOTAL_GPUS: "64" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/mid-curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/mid-curve.yaml new file mode 100644 index 000000000..fdfce3821 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/1k1k/disagg/stp/mid-curve.yaml @@ -0,0 +1,189 @@ +name: "gb200-fp4-1k1k-mid-curve" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 9 + nginx_container: nginx-sqsh + +model: + path: "dsr1-fp4" + container: "dynamo-sglang" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 4 + decode_nodes: 8 + prefill_workers: 4 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024" + SGLANG_MOE_NVFP4_DISPATCH: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_cutlass" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 2176 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.84 + max-total-tokens: 131072 + max-prefill-tokens: 32768 + chunked-prefill-size: 65536 + enable-single-batch-overlap: true + + # Request handling + max-running-requests: 30000 + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + enable-dp-attention: true + fp4-gemm-backend: "flashinfer_cutlass" + disaggregation-transfer-backend: nixl + + # Parallelism + tp-size: 4 + dp-size: 4 + ep-size: 4 + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_cutedsl" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 2176 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.83 + max-total-tokens: 3122380 + chunked-prefill-size: 786432 + + # Request handling + max-running-requests: 67584 + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + ep-num-redundant-experts: 32 + + # CUDA graphs (extensive batch size list) + cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 1024] + num-reserved-decode-tokens: 112 + + # Additional decode optimizations + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + enable-dp-attention: true + disaggregation-transfer-backend: nixl + fp4-gemm-backend: "flashinfer_cutlass" + + # Parallelism + tp-size: 32 + dp-size: 32 + ep-size: 32 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "48" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/low-latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/low-latency.yaml new file mode 100644 index 000000000..48b044bd3 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/low-latency.yaml @@ -0,0 +1,125 @@ +name: "gb200-fp4-8k1k-low-latency" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 4 + nginx_container: nginx-sqsh + +model: + path: "dsr1-fp4" + container: "dynamo-sglang" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 1 + decode_nodes: 4 + prefill_workers: 1 + decode_workers: 4 + gpus_per_node: 4 + +backend: + + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + + sglang_config: + prefill: + disaggregation-mode: "prefill" + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + disable-radix-cache: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_trtllm" + stream-interval: 50 + watchdog-timeout: 1000000 + context-length: 9600 + mem-fraction-static: 0.95 + max-total-tokens: 32768 + chunked-prefill-size: 24576 + cuda-graph-max-bs: 256 + max-running-requests: 512 + scheduler-recv-interval: 10 + enable-symm-mem: true + load-balance-method: "round_robin" + disaggregation-bootstrap-port: 30001 + data-parallel-size: 1 + disaggregation-transfer-backend: nixl + fp4-gemm-backend: "flashinfer_trtllm" + tensor-parallel-size: 4 + expert-parallel-size: 1 + enable-dp-attention: false + + decode: + disaggregation-mode: "decode" + served-model-name: "deepseek-ai/DeepSeek-R1" + prefill-round-robin-balance: true + trust-remote-code: true + disable-radix-cache: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_trtllm" + disaggregation-bootstrap-port: 30001 + stream-interval: 50 + watchdog-timeout: 1000000 + context-length: 9600 + mem-fraction-static: 0.95 + chunked-prefill-size: 8192 + cuda-graph-max-bs: 256 + scheduler-recv-interval: 10 + enable-symm-mem: true + disaggregation-transfer-backend: nixl + fp4-gemm-backend: "flashinfer_trtllm" + tensor-parallel-size: 4 + expert-parallel-size: 1 + enable-dp-attention: false + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "4" + DECODE_GPUS: "4" + TOTAL_GPUS: "20" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/max-tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/max-tpt.yaml new file mode 100644 index 000000000..cbf43343b --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/max-tpt.yaml @@ -0,0 +1,186 @@ +name: "gb200-fp4-8k1k-max-tpt" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 9 + nginx_container: nginx-sqsh + +model: + path: "dsr1-fp4" + container: "dynamo-sglang" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 10 + decode_nodes: 8 + prefill_workers: 10 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "512" + SGLANG_MOE_NVFP4_DISPATCH: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_trtllm" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 9600 + disable-shared-experts-fusion: true + disaggregation-bootstrap-port: 30001 + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.95 + max-total-tokens: 131072 + max-prefill-tokens: 524288 + chunked-prefill-size: 131072 + + # Request handling + max-running-requests: 30000 + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + enable-dp-attention: false + fp4-gemm-backend: "flashinfer_cutlass" + disaggregation-transfer-backend: nixl + + # Parallelism + tp-size: 4 + dp-size: 1 + ep-size: 1 + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_cutedsl" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 9600 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.83 + max-total-tokens: 524288 + chunked-prefill-size: 24576 + + # Request handling + max-running-requests: 16384 + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + ep-num-redundant-experts: 32 + + cuda-graph-max-bs: 512 + num-reserved-decode-tokens: 112 + + # Additional decode optimizations + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + enable-dp-attention: true + fp4-gemm-backend: "flashinfer_cutlass" + disaggregation-transfer-backend: nixl + + # Parallelism + tp-size: 32 + dp-size: 32 + ep-size: 32 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "72" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/mid-curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/mid-curve.yaml new file mode 100644 index 000000000..39f9ab7c8 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp4/8k1k/disagg/stp/mid-curve.yaml @@ -0,0 +1,186 @@ +name: "gb200-fp4-8k1k-mid-curve" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 9 + nginx_container: nginx-sqsh + +model: + path: "dsr1-fp4" + container: "dynamo-sglang" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 6 + decode_nodes: 12 + prefill_workers: 6 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "512" + SGLANG_MOE_NVFP4_DISPATCH: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_trtllm" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 9600 + disable-shared-experts-fusion: true + disaggregation-bootstrap-port: 30001 + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.95 + max-total-tokens: 131072 + max-prefill-tokens: 524288 + chunked-prefill-size: 131072 + + # Request handling + max-running-requests: 30000 + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + enable-dp-attention: false + fp4-gemm-backend: "flashinfer_cutlass" + disaggregation-transfer-backend: nixl + + # Parallelism + tp-size: 4 + dp-size: 1 + ep-size: 1 + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_cutedsl" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 9600 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.83 + max-total-tokens: 524288 + chunked-prefill-size: 24576 + + # Request handling + max-running-requests: 16384 + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + ep-num-redundant-experts: 32 + + cuda-graph-max-bs: 512 + num-reserved-decode-tokens: 112 + + # Additional decode optimizations + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + enable-dp-attention: true + fp4-gemm-backend: "flashinfer_cutlass" + disaggregation-transfer-backend: nixl + + # Parallelism + tp-size: 48 + dp-size: 48 + ep-size: 48 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "4" + DECODE_GPUS: "48" + TOTAL_GPUS: "72" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/low-latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/low-latency.yaml new file mode 100644 index 000000000..5dc0c0c73 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/low-latency.yaml @@ -0,0 +1,128 @@ +name: "gb200-fp8-1k1k-low-latency" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 2 + nginx_container: nginx + +model: + path: "dsr1-fp8" + container: "dynamo-sglang" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 1 + decode_nodes: 1 + prefill_workers: 1 + decode_workers: 1 + gpus_per_node: 4 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + SGLANG_ENABLE_FLASHINFER_GEMM: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + + sglang_config: + prefill: + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "fp8" + moe-runner-backend: "flashinfer_trtllm" + disable-radix-cache: true + stream-interval: 10 + watchdog-timeout: 1000000 + context-length: 2200 + disaggregation-mode: "prefill" + mem-fraction-static: 0.95 + max-total-tokens: 8192 + chunked-prefill-size: 8192 + cuda-graph-max-bs: 128 + max-running-requests: 512 + load-balance-method: "round_robin" + scheduler-recv-interval: 10 + fp8-gemm-backend: "flashinfer_trtllm" + enable-symm-mem: true + tensor-parallel-size: 4 + data-parallel-size: 1 + expert-parallel-size: 1 + disaggregation-bootstrap-port: 30001 + disaggregation-transfer-backend: nixl + + decode: + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "fp8" + moe-runner-backend: "flashinfer_trtllm" + disable-radix-cache: true + stream-interval: 10 + watchdog-timeout: 1000000 + context-length: 2200 + disaggregation-mode: "decode" + mem-fraction-static: 0.95 + chunked-prefill-size: 8192 + cuda-graph-max-bs: 128 + max-running-requests: 128 + scheduler-recv-interval: 10 + enable-symm-mem: true + prefill-round-robin-balance: true + tensor-parallel-size: 4 + data-parallel-size: 1 + expert-parallel-size: 1 + fp8-gemm-backend: "flashinfer_trtllm" + disaggregation-bootstrap-port: 30001 + disaggregation-transfer-backend: nixl + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "4" + DECODE_GPUS: "4" + TOTAL_GPUS: "8" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/max-tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/max-tpt.yaml new file mode 100644 index 000000000..c7a9e0923 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/max-tpt.yaml @@ -0,0 +1,182 @@ +name: "gb200-fp8-1k1k-max-tpt" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 9 + nginx_container: nginx + +model: + path: "dsr1-fp8" + container: "dynamo-sglang" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 4 + prefill_workers: 2 + decode_nodes: 8 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + max-running-requests: 30000 + context-length: 2200 + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.75 + max-total-tokens: 524288 + chunked-prefill-size: 131072 + + # Request handling + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "normal" + ep-dispatch-algorithm: "dynamic" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + + disaggregation-transfer-backend: nixl + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 32 + dp-size: 32 + ep-size: 32 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + max-running-requests: 45000 + context-length: 2200 + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.82 + chunked-prefill-size: 36864 + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + + # CUDA graphs + cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768] + cuda-graph-max-bs: 768 + + disaggregation-transfer-backend: nixl + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "32" + TOTAL_GPUS: "48" + diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/mid-curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/mid-curve.yaml new file mode 100644 index 000000000..0de49d6d7 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/mid-curve.yaml @@ -0,0 +1,181 @@ +name: "gb200-fp8-1k1k-mid-curve" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 9 + nginx_container: nginx + +model: + path: "dsr1-fp8" + container: "dynamo-sglang" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 6 + prefill_workers: 3 + decode_nodes: 12 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + max-running-requests: 30000 + context-length: 2200 + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.75 + max-total-tokens: 524288 + chunked-prefill-size: 131072 + + # Request handling + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "normal" + ep-dispatch-algorithm: "dynamic" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + disaggregation-transfer-backend: nixl + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 48 + dp-size: 48 + ep-size: 48 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + max-running-requests: 45000 + context-length: 2200 + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.82 + chunked-prefill-size: 36864 + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + + # CUDA graphs + cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768] + cuda-graph-max-bs: 768 + disaggregation-transfer-backend: nixl + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "48" + TOTAL_GPUS: "72" + diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/ultra-tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/ultra-tpt.yaml new file mode 100644 index 000000000..f335aa042 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/1k1k/disagg/stp/ultra-tpt.yaml @@ -0,0 +1,183 @@ +name: "gb200-fp8-1k1k-ultra-tpt" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 3 + nginx_container: nginx + +model: + path: "dsr1-fp8" + container: "dynamo-sglang" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + decode_nodes: 2 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "640" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + max-running-requests: 8192 + context-length: 2200 + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.75 + max-total-tokens: 524288 + chunked-prefill-size: 131072 + + # Request handling + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "normal" + ep-dispatch-algorithm: "dynamic" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + + disaggregation-transfer-backend: nixl + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + max-running-requests: 5120 + context-length: 2200 + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.82 + chunked-prefill-size: 36864 + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + + # CUDA graphs + cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640] + cuda-graph-max-bs: 640 + + disaggregation-transfer-backend: nixl + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/low-latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/low-latency.yaml new file mode 100644 index 000000000..94ee5ed1f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/low-latency.yaml @@ -0,0 +1,124 @@ +name: "gb200-fp8-8k1k-low-latency" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 2 + nginx_container: nginx + +model: + path: "dsr1-fp8" + container: "dynamo-sglang" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + decode_nodes: 2 + prefill_workers: 1 + decode_workers: 1 + gpus_per_node: 4 + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + + sglang_config: + prefill: + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "fp8" + moe-runner-backend: "flashinfer_trtllm" + disable-radix-cache: true + watchdog-timeout: 1000000 + context-length: 9600 + disaggregation-mode: "prefill" + mem-fraction-static: 0.8 + max-total-tokens: 32768 + chunked-prefill-size: 24576 + cuda-graph-max-bs: 512 + max-running-requests: 512 + load-balance-method: "round_robin" + scheduler-recv-interval: 10 + tensor-parallel-size: 8 + data-parallel-size: 1 + expert-parallel-size: 1 + fp8-gemm-backend: "flashinfer_trtllm" + disaggregation-bootstrap-port: 30001 + disaggregation-transfer-backend: nixl + + decode: + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "fp8" + moe-runner-backend: "flashinfer_trtllm" + disable-radix-cache: true + watchdog-timeout: 1000000 + context-length: 9600 + disaggregation-mode: "decode" + mem-fraction-static: 0.8 + chunked-prefill-size: 8192 + cuda-graph-max-bs: 512 + max-running-requests: 512 + scheduler-recv-interval: 10 + enable-symm-mem: true + prefill-round-robin-balance: true + tensor-parallel-size: 8 + data-parallel-size: 1 + expert-parallel-size: 1 + fp8-gemm-backend: "flashinfer_trtllm" + disaggregation-bootstrap-port: 30001 + disaggregation-transfer-backend: nixl + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/max_tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/max_tpt.yaml new file mode 100644 index 000000000..2865f2e52 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/max_tpt.yaml @@ -0,0 +1,178 @@ +name: "gb200-8k1k-fp8-max-tpt" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 9 + nginx_container: nginx + +model: + path: "dsr1-fp8" + container: "dynamo-sglang" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 12 + prefill_workers: 6 + decode_nodes: 6 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "512" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + max-running-requests: 30000 + context-length: 9300 + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.80 + max-total-tokens: 524288 + chunked-prefill-size: 131072 + + # Request handling + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "normal" + ep-dispatch-algorithm: "dynamic" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + disaggregation-bootstrap-port: 30001 + disaggregation-transfer-backend: nixl + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 24 + dp-size: 24 + ep-size: 24 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + max-running-requests: 8192 + context-length: 9300 + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.82 + chunked-prefill-size: 36864 + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + # CUDA graphs + cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512] + cuda-graph-max-bs: 512 + disaggregation-bootstrap-port: 30001 + disaggregation-transfer-backend: nixl + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "24" + TOTAL_GPUS: "72" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/mid-curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/mid-curve.yaml new file mode 100644 index 000000000..a1559e71d --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb200-fp8/8k1k/disagg/stp/mid-curve.yaml @@ -0,0 +1,177 @@ +name: "gb200-8k1k-fp8-mid-tpt" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 9 + nginx_container: nginx + +model: + path: "dsr1-fp8" + container: "dynamo-sglang" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 10 + prefill_workers: 5 + decode_nodes: 8 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "256" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + max-running-requests: 30000 + context-length: 9300 + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.80 + max-total-tokens: 524288 + chunked-prefill-size: 131072 + + # Request handling + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "normal" + ep-dispatch-algorithm: "dynamic" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + disaggregation-bootstrap-port: 30001 + disaggregation-transfer-backend: nixl + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 32 + dp-size: 32 + ep-size: 32 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + max-running-requests: 8192 + context-length: 9300 + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.82 + chunked-prefill-size: 36864 + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + # CUDA graphs + cuda-graph-max-bs: 256 + disaggregation-bootstrap-port: 30001 + disaggregation-transfer-backend: nixl + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "32" + TOTAL_GPUS: "72" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/low_latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/low_latency.yaml new file mode 100644 index 000000000..c531f8446 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/low_latency.yaml @@ -0,0 +1,123 @@ +name: "gb300-fp4-low-latency-1k1k" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 4 + nginx_container: nginx-sqsh + +model: + path: "dsr1" + container: "dynamo-sglang" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + decode_nodes: 2 + prefill_workers: 1 + decode_workers: 2 + gpus_per_node: 4 + +backend: + + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + + sglang_config: + prefill: + disaggregation-mode: "prefill" + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + disable-radix-cache: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_trtllm" + stream-interval: 10 + watchdog-timeout: 1000000 + context-length: 2200 + mem-fraction-static: 0.95 + max-total-tokens: 8192 + chunked-prefill-size: 8192 + cuda-graph-max-bs: 256 + max-running-requests: 512 + scheduler-recv-interval: 10 + enable-symm-mem: true + load-balance-method: "round_robin" + disaggregation-bootstrap-port: 30001 + data-parallel-size: 1 + tensor-parallel-size: 4 + expert-parallel-size: 1 + fp4-gemm-backend: "flashinfer_trtllm" + disaggregation-transfer-backend: nixl + + decode: + disaggregation-mode: "decode" + served-model-name: "deepseek-ai/DeepSeek-R1" + prefill-round-robin-balance: true + trust-remote-code: true + disable-radix-cache: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_trtllm" + disaggregation-bootstrap-port: 30001 + stream-interval: 10 + watchdog-timeout: 1000000 + context-length: 2200 + mem-fraction-static: 0.95 + chunked-prefill-size: 8192 + cuda-graph-max-bs: 256 + scheduler-recv-interval: 10 + enable-symm-mem: true + tensor-parallel-size: 4 + expert-parallel-size: 1 + fp4-gemm-backend: "flashinfer_trtllm" + disaggregation-transfer-backend: nixl + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "4" + DECODE_GPUS: "4" + TOTAL_GPUS: "12" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/max_tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/max_tpt.yaml new file mode 100644 index 000000000..c4a3d6524 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/max_tpt.yaml @@ -0,0 +1,191 @@ +name: "gb300-fp4-max-tpt-1k1k" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 9 + nginx_container: nginx-sqsh + +model: + path: "dsr1" + container: "dynamo-sglang" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 4 + decode_nodes: 12 + prefill_workers: 4 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024" + SGLANG_MOE_NVFP4_DISPATCH: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_cutlass" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 2176 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.84 + max-total-tokens: 131072 + max-prefill-tokens: 32768 + chunked-prefill-size: 65536 + enable-single-batch-overlap: true + + # Request handling + max-running-requests: 30000 + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + enable-dp-attention: true + disaggregation-transfer-backend: nixl + fp4-gemm-backend: "flashinfer_cutlass" + + # Parallelism + tp-size: 4 + dp-size: 4 + ep-size: 4 + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_cutedsl" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 2176 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.83 + max-total-tokens: 3122380 + chunked-prefill-size: 786432 + + # Request handling + max-running-requests: 67584 + enable-single-batch-overlap: true + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + ep-num-redundant-experts: 32 + + # CUDA graphs (extensive batch size list) + cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 1024] + num-reserved-decode-tokens: 112 + + # Additional decode optimizations + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + enable-dp-attention: true + fp4-gemm-backend: "flashinfer_cutlass" + disaggregation-transfer-backend: nixl + + + # Parallelism + tp-size: 48 + dp-size: 48 + ep-size: 48 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "4" + DECODE_GPUS: "48" + TOTAL_GPUS: "64" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/mid_curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/mid_curve.yaml new file mode 100644 index 000000000..e6d388906 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/1k1k/disagg/stp/mid_curve.yaml @@ -0,0 +1,189 @@ +name: "gb300-fp4-mid-curve-1k1k" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 9 + nginx_container: nginx-sqsh + +model: + path: "dsr1" + container: "dynamo-sglang" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 4 + decode_nodes: 8 + prefill_workers: 4 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024" + SGLANG_MOE_NVFP4_DISPATCH: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_cutlass" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 2176 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Prefill-specific mode + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.84 + max-total-tokens: 131072 + max-prefill-tokens: 32768 + chunked-prefill-size: 65536 + enable-single-batch-overlap: true + + # Request handling + max-running-requests: 30000 + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + enable-dp-attention: true + fp4-gemm-backend: "flashinfer_cutlass" + + # Parallelism + tp-size: 4 + dp-size: 4 + ep-size: 4 + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_cutedsl" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 2176 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.83 + max-total-tokens: 3122380 + chunked-prefill-size: 786432 + + # Request handling + max-running-requests: 67584 + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + ep-num-redundant-experts: 32 + + # CUDA graphs (extensive batch size list) + cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 1024] + num-reserved-decode-tokens: 112 + + # Additional decode optimizations + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + enable-dp-attention: true + fp4-gemm-backend: "flashinfer_cutlass" + disaggregation-transfer-backend: nixl + + # Parallelism + tp-size: 32 + dp-size: 32 + ep-size: 32 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "48" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/low_latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/low_latency.yaml new file mode 100644 index 000000000..5c95e1ffa --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/low_latency.yaml @@ -0,0 +1,126 @@ +name: "gb300-8k1k-fp4-low-latency-8k1k" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 3 + nginx_container: nginx-sqsh + +model: + path: "dsr1" + container: "dynamo-sglang" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + decode_nodes: 4 + prefill_workers: 1 + decode_workers: 4 + gpus_per_node: 4 + +backend: + + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + + sglang_config: + prefill: + disaggregation-mode: "prefill" + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + disable-radix-cache: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_trtllm" + stream-interval: 50 + watchdog-timeout: 1000000 + context-length: 9600 + mem-fraction-static: 0.95 + max-total-tokens: 32768 + chunked-prefill-size: 24576 + cuda-graph-max-bs: 256 + max-running-requests: 512 + scheduler-recv-interval: 10 + enable-symm-mem: true + load-balance-method: "round_robin" + disaggregation-bootstrap-port: 30001 + data-parallel-size: 1 + tensor-parallel-size: 4 + expert-parallel-size: 1 + enable-dp-attention: false + fp4-gemm-backend: "flashinfer_trtllm" + disaggregation-transfer-backend: nixl + + + decode: + disaggregation-mode: "decode" + served-model-name: "deepseek-ai/DeepSeek-R1" + prefill-round-robin-balance: true + trust-remote-code: true + disable-radix-cache: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_trtllm" + disaggregation-bootstrap-port: 30001 + stream-interval: 50 + watchdog-timeout: 1000000 + context-length: 9600 + mem-fraction-static: 0.95 + chunked-prefill-size: 8192 + cuda-graph-max-bs: 128 + scheduler-recv-interval: 10 + enable-symm-mem: true + tensor-parallel-size: 4 + expert-parallel-size: 1 + enable-dp-attention: false + fp4-gemm-backend: "flashinfer_trtllm" + disaggregation-transfer-backend: nixl + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "4" + DECODE_GPUS: "4" + TOTAL_GPUS: "20" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/max_tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/max_tpt.yaml new file mode 100644 index 000000000..29a619a6f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/max_tpt.yaml @@ -0,0 +1,186 @@ +name: "gb300-fp4-8k1k-max-tpt" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 9 + nginx_container: nginx-sqsh + +model: + path: "dsr1" + container: "dynamo-sglang" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 10 + decode_nodes: 8 + prefill_workers: 10 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "512" + SGLANG_MOE_NVFP4_DISPATCH: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_trtllm" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 9600 + disable-shared-experts-fusion: true + disaggregation-bootstrap-port: 30001 + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.95 + max-total-tokens: 131072 + max-prefill-tokens: 524288 + chunked-prefill-size: 131072 + + # Request handling + max-running-requests: 30000 + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + enable-dp-attention: false + fp4-gemm-backend: "flashinfer_cutlass" + disaggregation-transfer-backend: nixl + + # Parallelism + tp-size: 4 + dp-size: 1 + ep-size: 1 + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_cutedsl" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 9600 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.83 + max-total-tokens: 524288 + chunked-prefill-size: 24576 + + # Request handling + max-running-requests: 16384 + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + ep-num-redundant-experts: 32 + + cuda-graph-max-bs: 512 + num-reserved-decode-tokens: 112 + + # Additional decode optimizations + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + enable-dp-attention: true + fp4-gemm-backend: "flashinfer_cutlass" + disaggregation-transfer-backend: nixl + + # Parallelism + tp-size: 32 + dp-size: 32 + ep-size: 32 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "72" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/mid_curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/mid_curve.yaml new file mode 100644 index 000000000..b4de76bb9 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp4/8k1k/disagg/stp/mid_curve.yaml @@ -0,0 +1,186 @@ +name: "gb300-fp4-8k1k-mid-curve" + +dynamo: + version: 0.8.1 + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 9 + nginx_container: nginx-sqsh + +model: + path: "dsr1" + container: "dynamo-sglang" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 6 + decode_nodes: 12 + prefill_workers: 6 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "512" + SGLANG_MOE_NVFP4_DISPATCH: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_trtllm" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 9600 + disable-shared-experts-fusion: true + disaggregation-bootstrap-port: 30001 + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.95 + max-total-tokens: 131072 + max-prefill-tokens: 524288 + chunked-prefill-size: 131072 + + # Request handling + max-running-requests: 30000 + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + enable-dp-attention: false + fp4-gemm-backend: "flashinfer_cutlass" + disaggregation-transfer-backend: nixl + + # Parallelism + tp-size: 4 + dp-size: 1 + ep-size: 1 + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + + # KV cache and attention + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + + # Quantization + quantization: "modelopt_fp4" + moe-runner-backend: "flashinfer_cutedsl" + + # Radix cache disabled + disable-radix-cache: true + disable-chunked-prefix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + watchdog-timeout: 1000000 + context-length: 9600 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.83 + max-total-tokens: 524288 + chunked-prefill-size: 24576 + + # Request handling + max-running-requests: 16384 + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + ep-num-redundant-experts: 32 + + cuda-graph-max-bs: 512 + num-reserved-decode-tokens: 112 + + # Additional decode optimizations + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + enable-dp-attention: true + fp4-gemm-backend: "flashinfer_cutlass" + disaggregation-transfer-backend: nixl + + # Parallelism + tp-size: 48 + dp-size: 48 + ep-size: 48 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "4" + DECODE_GPUS: "48" + TOTAL_GPUS: "72" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/low-latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/low-latency.yaml new file mode 100644 index 000000000..57ea3ff5e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/low-latency.yaml @@ -0,0 +1,129 @@ +name: "gb300-1k1k-fp8-low-latency" + +model: + path: "dsr1-fp8" + container: "dynamo-sglang" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + decode_nodes: 4 + prefill_workers: 1 + decode_workers: 4 + gpus_per_node: 4 + +slurm: + time_limit: "02:00:00" + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + # SGLANG_ENABLE_FLASHINFER_GEMM: "1" # deprecated in 0.5.7, --fp8-gemm-backend=flashinfer_trtllm + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + # SGLANG_ENABLE_FLASHINFER_GEMM: "1" # deprecated in 0.5.7, --fp8-gemm-backend=flashinfer_trtllm + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + + sglang_config: + prefill: + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "fp8" + moe-runner-backend: "flashinfer_trtllm" + fp8-gemm-backend: "flashinfer_trtllm" + disable-radix-cache: true + stream-interval: 10 + watchdog-timeout: 1000000 + context-length: 2200 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + mem-fraction-static: 0.95 + max-total-tokens: 8192 + chunked-prefill-size: 8192 + max-prefill-tokens: 8192 + cuda-graph-max-bs: 128 + max-running-requests: 128 + load-balance-method: "round_robin" + scheduler-recv-interval: 10 + enable-flashinfer-allreduce-fusion: false # to save mem + enable-symm-mem: false # to save mem + tensor-parallel-size: 4 + data-parallel-size: 1 + expert-parallel-size: 1 + + decode: + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "fp8" + moe-runner-backend: "flashinfer_trtllm" + fp8-gemm-backend: "flashinfer_trtllm" + disable-radix-cache: true + stream-interval: 10 + watchdog-timeout: 1000000 + context-length: 2200 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + mem-fraction-static: 0.85 + chunked-prefill-size: -1 # save mem + cuda-graph-max-bs: 128 + max-running-requests: 128 + scheduler-recv-interval: 1 # save mem + enable-flashinfer-allreduce-fusion: false # to save mem + enable-symm-mem: false # to save mem + prefill-round-robin-balance: true + tensor-parallel-size: 4 + data-parallel-size: 1 + expert-parallel-size: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "4" + DECODE_GPUS: "4" + TOTAL_GPUS: "20" + diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/max.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/max.yaml new file mode 100644 index 000000000..d27830a5f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/max.yaml @@ -0,0 +1,178 @@ +# GB300 FP8 Max Throughput Configuration + +name: "gb300-1k1k-fp8-max" + +model: + path: "dsr1-fp8" + container: "dynamo-sglang" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "gb300" + prefill_nodes: 2 + prefill_workers: 1 + decode_nodes: 2 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + max-running-requests: 30000 + context-length: 2200 + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + disaggregation-transfer-backend: nixl + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.75 + max-total-tokens: 524288 + chunked-prefill-size: 131072 + + # Request handling + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "normal" + ep-dispatch-algorithm: "dynamic" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + disaggregation-transfer-backend: nixl + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + max-running-requests: 45000 + context-length: 2200 + + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.82 + chunked-prefill-size: 36864 + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + + # CUDA graphs + cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 1024] + cuda-graph-max-bs: 1024 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/mid.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/mid.yaml new file mode 100644 index 000000000..507f5607a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/1k1k/disagg/stp/mid.yaml @@ -0,0 +1,177 @@ +# GB300 FP8 Mid Throughput Configuration +name: "gb300-1k1k-fp8-mid" + +model: + path: "dsr1-fp8" + container: "dynamo-sglang" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "gb300" + prefill_nodes: 4 + prefill_workers: 2 + decode_nodes: 8 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + max-running-requests: 30000 + context-length: 2200 + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + disaggregation-transfer-backend: nixl + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.75 + max-total-tokens: 524288 + chunked-prefill-size: 131072 + + # Request handling + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "normal" + ep-dispatch-algorithm: "dynamic" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + disaggregation-transfer-backend: nixl + + # Parallelism + tp-size: 32 + dp-size: 32 + ep-size: 32 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + max-running-requests: 45000 + context-length: 2200 + + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.82 + chunked-prefill-size: 36864 + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + + # CUDA graphs + cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768] + cuda-graph-max-bs: 768 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "32" + TOTAL_GPUS: "48" + diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/low-latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/low-latency.yaml new file mode 100644 index 000000000..766ecc632 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/low-latency.yaml @@ -0,0 +1,128 @@ +name: "gb300-8k1k-fp8-low-latency" + +model: + path: "dsr1-fp8" + container: "dynamo-sglang" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + decode_nodes: 1 + prefill_workers: 1 + decode_workers: 1 + gpus_per_node: 4 + +slurm: + time_limit: "02:00:00" + +backend: + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + # SGLANG_ENABLE_FLASHINFER_GEMM: "1" # deprecated in 0.5.7, --fp8-gemm-backend=flashinfer_trtllm + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + # SGLANG_ENABLE_FLASHINFER_GEMM: "1" # deprecated in 0.5.7, --fp8-gemm-backend=flashinfer_trtllm + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + + sglang_config: + prefill: + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "fp8" + moe-runner-backend: "flashinfer_trtllm" + fp8-gemm-backend: "flashinfer_trtllm" + disable-radix-cache: true + stream-interval: 10 + watchdog-timeout: 1000000 + context-length: 9300 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + mem-fraction-static: 0.95 + max-total-tokens: 32768 + chunked-prefill-size: 32768 + max-prefill-tokens: 32768 + cuda-graph-max-bs: 128 + max-running-requests: 128 + load-balance-method: "round_robin" + scheduler-recv-interval: 10 + enable-flashinfer-allreduce-fusion: false # to save mem + enable-symm-mem: false # to save mem + tensor-parallel-size: 4 + data-parallel-size: 1 + expert-parallel-size: 1 + + decode: + served-model-name: "deepseek-ai/DeepSeek-R1" + trust-remote-code: true + kv-cache-dtype: "fp8_e4m3" + attention-backend: "trtllm_mla" + quantization: "fp8" + moe-runner-backend: "flashinfer_trtllm" + fp8-gemm-backend: "flashinfer_trtllm" + disable-radix-cache: true + stream-interval: 10 + watchdog-timeout: 1000000 + context-length: 9300 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + mem-fraction-static: 0.85 + chunked-prefill-size: -1 # save mem + cuda-graph-max-bs: 128 + max-running-requests: 128 + scheduler-recv-interval: 1 # save mem + enable-flashinfer-allreduce-fusion: false # to save mem + enable-symm-mem: false # to save mem + prefill-round-robin-balance: true + tensor-parallel-size: 4 + data-parallel-size: 1 + expert-parallel-size: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "4" + DECODE_GPUS: "4" + TOTAL_GPUS: "8" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/max.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/max.yaml new file mode 100644 index 000000000..a7da42825 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/max.yaml @@ -0,0 +1,178 @@ +# GB300 FP8 Max Throughput Configuration + +name: "gb300-8k1k-fp8-max" + +model: + path: "dsr1-fp8" + container: "dynamo-sglang" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "gb300" + prefill_nodes: 12 + prefill_workers: 6 + decode_nodes: 6 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + max-running-requests: 30000 + context-length: 9300 + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + disaggregation-transfer-backend: nixl + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.75 + max-total-tokens: 524288 + chunked-prefill-size: 131072 + + # Request handling + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "normal" + ep-dispatch-algorithm: "dynamic" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + disaggregation-transfer-backend: nixl + + # Parallelism + tp-size: 24 + dp-size: 24 + ep-size: 24 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + max-running-requests: 45000 + context-length: 9300 + + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.82 + chunked-prefill-size: 36864 + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + + # CUDA graphs + cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768] + cuda-graph-max-bs: 768 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "24" + TOTAL_GPUS: "72" + diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/mid.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/mid.yaml new file mode 100644 index 000000000..6c367ebf3 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/gb300-fp8/8k1k/disagg/stp/mid.yaml @@ -0,0 +1,178 @@ +# GB300 FP8 Mid Throughput Configuration + +name: "gb300-8k1k-fp8-mid" + +model: + path: "dsr1-fp8" + container: "dynamo-sglang" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "gb300" + prefill_nodes: 10 + prefill_workers: 5 + decode_nodes: 8 + decode_workers: 1 + gpus_per_node: 4 + +backend: + + # Prefill-specific environment variables + prefill_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + # Decode-specific environment variables + decode_environment: + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + max-running-requests: 30000 + context-length: 9300 + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + disaggregation-transfer-backend: nixl + + # Prefill-specific mode + disaggregation-mode: "prefill" + + # Memory and token limits + mem-fraction-static: 0.75 + max-total-tokens: 524288 + chunked-prefill-size: 131072 + + # Request handling + load-balance-method: "round_robin" + + # Performance optimizations + disable-cuda-graph: true + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "normal" + ep-dispatch-algorithm: "dynamic" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + skip-tokenizer-init: true + trust-remote-code: true + disaggregation-transfer-backend: nixl + + # Parallelism + tp-size: 32 + dp-size: 32 + ep-size: 32 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "trtllm_mla" + kv-cache-dtype: "fp8_e4m3" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + stream-interval: 50 + decode-log-interval: 1000 + max-running-requests: 45000 + context-length: 9300 + + watchdog-timeout: 1000000 + disable-shared-experts-fusion: true + eplb-algorithm: "deepseek" + disaggregation-bootstrap-port: 30001 + + # Decode-specific mode + disaggregation-mode: "decode" + + # Memory and token limits + mem-fraction-static: 0.82 + chunked-prefill-size: 36864 + + # DeepEP configuration + moe-a2a-backend: "deepep" + deepep-mode: "low_latency" + ep-dispatch-algorithm: "static" + moe-dense-tp-size: 1 + enable-dp-lm-head: true + prefill-round-robin-balance: true + ep-num-redundant-experts: 32 + deepep-config: "/configs/deepep_config.json" + + # CUDA graphs + cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768] + cuda-graph-max-bs: 768 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "32" + TOTAL_GPUS: "72" + diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml new file mode 100644 index 000000000..76f03d343 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml @@ -0,0 +1,121 @@ +name: "h100-fp8-1p1d-max-dep-mtp" + +model: + path: "dsr1-fp8" + container: "lmsysorg/sglang:v0.5.8-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx-sqsh + +resources: + gpu_type: "h100" + prefill_nodes: 2 + prefill_workers: 1 + decode_nodes: 2 + decode_workers: 1 + gpus_per_node: 8 + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + + # Decode-specific environment variables + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 1 + ep-size: 1 + enable-dp-attention: false + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Prefill capacity + max-running-requests: 4 + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 2048 + chunked-prefill-size: 2048 + + # Request handling + load-balance-method: "round_robin" + + # MTP (Multi-Token Prediction) + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 16 + ep-size: 16 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 1 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.85 + max-running-requests: 64 + cuda-graph-max-bs: 64 + + # MTP + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "32" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p2d-max-tp-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p2d-max-tp-mtp.yaml new file mode 100644 index 000000000..3c6647c24 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/mtp/h100-fp8-1p2d-max-tp-mtp.yaml @@ -0,0 +1,123 @@ +name: "h100-fp8-1p2d-max-tp-mtp" + +model: + path: "dsr1-fp8" + container: "lmsysorg/sglang:v0.5.8-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx-sqsh + +resources: + gpu_type: "h100" + prefill_nodes: 2 + prefill_workers: 1 + decode_nodes: 4 + decode_workers: 2 + gpus_per_node: 8 + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_ENABLE_SPEC_V2: "1" + + # Decode-specific environment variables + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_ENABLE_SPEC_V2: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 1 + ep-size: 1 + enable-dp-attention: false + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + max-running-requests: 2 + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 2048 + chunked-prefill-size: 2048 + + # Request handling + load-balance-method: "round_robin" + + # MTP (Multi-Token Prediction) + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 1 + ep-size: 1 + enable-dp-attention: false + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 1 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.9 + max-running-requests: 128 + cuda-graph-max-bs: 128 + + # MTP + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "48" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/stp/h100-fp8-1p1d-max-dep.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/stp/h100-fp8-1p1d-max-dep.yaml new file mode 100644 index 000000000..dc186726c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/stp/h100-fp8-1p1d-max-dep.yaml @@ -0,0 +1,109 @@ +name: "h100-fp8-1p1d-max-dep" + +model: + path: "dsr1-fp8" + container: "lmsysorg/sglang:v0.5.8-cu130" + precision: "fp8" + +resources: + gpu_type: "h100" + prefill_nodes: 2 + prefill_workers: 1 + decode_nodes: 2 + decode_workers: 1 + gpus_per_node: 8 + +frontend: + nginx_container: nginx-sqsh + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + + # Decode-specific environment variables + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 1 + ep-size: 1 + enable-dp-attention: false + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Prefill capacity + max-running-requests: 4 + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 2048 + chunked-prefill-size: 2048 + + # Request handling + load-balance-method: "round_robin" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 16 + ep-size: 16 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 1 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.9 + max-running-requests: 64 + cuda-graph-max-bs: 64 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "32" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/stp/h100-fp8-1p2d-max-tp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/stp/h100-fp8-1p2d-max-tp.yaml new file mode 100644 index 000000000..1e4b20c13 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/1k1k/disagg/stp/h100-fp8-1p2d-max-tp.yaml @@ -0,0 +1,109 @@ +name: "h100-fp8-1p2d-max-tp" + +model: + path: "dsr1-fp8" + container: "lmsysorg/sglang:v0.5.8-cu130" + precision: "fp8" + +resources: + gpu_type: "h100" + prefill_nodes: 2 + prefill_workers: 1 + decode_nodes: 4 + decode_workers: 2 + gpus_per_node: 8 + +frontend: + nginx_container: nginx-sqsh + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + + # Decode-specific environment variables + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 1 + ep-size: 1 + enable-dp-attention: false + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + max-running-requests: 2 + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 2048 + chunked-prefill-size: 2048 + + # Request handling + load-balance-method: "round_robin" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 1 + ep-size: 1 + enable-dp-attention: false + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 1 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.9 + max-running-requests: 128 + cuda-graph-max-bs: 128 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "48" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml new file mode 100644 index 000000000..17b87aba7 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-dep-mtp.yaml @@ -0,0 +1,123 @@ +name: "h100-fp8-1p1d-max-dep-mtp" + +model: + path: "dsr1-fp8" + container: "lmsysorg/sglang:v0.5.8-cu130" + precision: "fp8" + +resources: + gpu_type: "h100" + prefill_nodes: 2 + prefill_workers: 1 + decode_nodes: 2 + decode_workers: 1 + gpus_per_node: 8 + +frontend: + nginx_container: nginx-sqsh + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_ENABLE_SPEC_V2: "1" + + # Decode-specific environment variables + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_ENABLE_SPEC_V2: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 1 + ep-size: 1 + enable-dp-attention: false + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Prefill capacity + max-running-requests: 4 + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 2048 + chunked-prefill-size: 2048 + + # Request handling + load-balance-method: "round_robin" + + # MTP (Multi-Token Prediction) + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 16 + ep-size: 16 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 1 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.85 + max-running-requests: 64 + cuda-graph-max-bs: 64 + + # MTP + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "32" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-tp-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-tp-mtp.yaml new file mode 100644 index 000000000..4dbe673c6 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/mtp/h100-fp8-1p1d-max-tp-mtp.yaml @@ -0,0 +1,123 @@ +name: "h100-fp8-1p1d-max-tp-mtp" + +model: + path: "dsr1-fp8" + container: "lmsysorg/sglang:v0.5.8-cu130" + precision: "fp8" + +resources: + gpu_type: "h100" + prefill_nodes: 2 + prefill_workers: 1 + decode_nodes: 2 + decode_workers: 1 + gpus_per_node: 8 + +frontend: + nginx_container: nginx-sqsh + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_ENABLE_SPEC_V2: "1" + + # Decode-specific environment variables + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_ENABLE_SPEC_V2: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 1 + ep-size: 1 + enable-dp-attention: false + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Prefill capacity + max-running-requests: 2 + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 2048 + chunked-prefill-size: 2048 + + # Request handling + load-balance-method: "round_robin" + + # MTP (Multi-Token Prediction) + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 1 + ep-size: 1 + enable-dp-attention: false + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 1 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.9 + max-running-requests: 128 + cuda-graph-max-bs: 128 + + # MTP (Multi-Token Prediction) + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "32" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/stp/h100-fp8-1p1d-max-dep.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/stp/h100-fp8-1p1d-max-dep.yaml new file mode 100644 index 000000000..dc186726c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/stp/h100-fp8-1p1d-max-dep.yaml @@ -0,0 +1,109 @@ +name: "h100-fp8-1p1d-max-dep" + +model: + path: "dsr1-fp8" + container: "lmsysorg/sglang:v0.5.8-cu130" + precision: "fp8" + +resources: + gpu_type: "h100" + prefill_nodes: 2 + prefill_workers: 1 + decode_nodes: 2 + decode_workers: 1 + gpus_per_node: 8 + +frontend: + nginx_container: nginx-sqsh + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + + # Decode-specific environment variables + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 1 + ep-size: 1 + enable-dp-attention: false + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Prefill capacity + max-running-requests: 4 + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 2048 + chunked-prefill-size: 2048 + + # Request handling + load-balance-method: "round_robin" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 16 + ep-size: 16 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 1 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.9 + max-running-requests: 64 + cuda-graph-max-bs: 64 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "32" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/stp/h100-fp8-1p1d-max-tp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/stp/h100-fp8-1p1d-max-tp.yaml new file mode 100644 index 000000000..120b9270c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h100-fp8/8k1k/disagg/stp/h100-fp8-1p1d-max-tp.yaml @@ -0,0 +1,109 @@ +name: "h100-fp8-1p1d-max-tp" + +model: + path: "dsr1-fp8" + container: "lmsysorg/sglang:v0.5.8-cu130" + precision: "fp8" + +resources: + gpu_type: "h100" + prefill_nodes: 2 + prefill_workers: 1 + decode_nodes: 2 + decode_workers: 1 + gpus_per_node: 8 + +frontend: + nginx_container: nginx-sqsh + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + + # Decode-specific environment variables + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 1 + ep-size: 1 + enable-dp-attention: false + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Prefill capacity + max-running-requests: 2 + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.6 + max-prefill-tokens: 2048 + chunked-prefill-size: 2048 + + # Request handling + load-balance-method: "round_robin" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 16 + dp-size: 1 + ep-size: 1 + enable-dp-attention: false + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 1 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.9 + max-running-requests: 128 + cuda-graph-max-bs: 128 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "32" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-dep-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-dep-mtp.yaml new file mode 100644 index 000000000..d9177b2e1 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-dep-mtp.yaml @@ -0,0 +1,128 @@ +name: "bs256-1p6d-h200-fp8-mtp" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 6 + decode_workers: 6 + gpus_per_node: 8 + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + # Decode-specific environment variables + decode_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + # used to be 512 + max-running-requests: 64 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + # usd to be 0.75 + mem-fraction-static: 0.82 + max-prefill-tokens: 65536 + # used to be 262144 + chunked-prefill-size: 65536 + + # Request handling + load-balance-method: "round_robin" + + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + max-running-requests: 128 + cuda-graph-max-bs: 128 + + # MTP settings + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-tp-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-tp-mtp.yaml new file mode 100644 index 000000000..bbdea98a4 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/bs256-1p6d-tp-mtp.yaml @@ -0,0 +1,124 @@ +name: "bs256-1p6d-h200-fp8-mtp" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 6 + decode_workers: 6 + gpus_per_node: 8 + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + # Decode-specific environment variables + decode_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + max-running-requests: 512 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.7 + max-prefill-tokens: 163840 + chunked-prefill-size: 163840 + + # Request handling + load-balance-method: "round_robin" + + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + max-running-requests: 128 + cuda-graph-max-bs: 128 + + # MTP settings + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/low-latency-1p9d-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/low-latency-1p9d-mtp.yaml new file mode 100644 index 000000000..2569666c2 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/mtp/low-latency-1p9d-mtp.yaml @@ -0,0 +1,123 @@ +name: "low-latency-1p9d-h200-fp8-mtp" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 9 + decode_workers: 9 + gpus_per_node: 8 + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + # Decode-specific environment variables + decode_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + max-running-requests: 256 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-prefill-tokens: 163840 + chunked-prefill-size: 163840 + + # Request handling + load-balance-method: "round_robin" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + max-running-requests: 64 + cuda-graph-max-bs: 64 + + # MTP settings + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "80" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-dep.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-dep.yaml new file mode 100644 index 000000000..0d098c736 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-dep.yaml @@ -0,0 +1,116 @@ +name: "bs256-1p6d-h200-fp8" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 6 + decode_workers: 6 + gpus_per_node: 8 + +backend: + + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + watchdog-timeout: 1000000 + max-running-requests: 512 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + max-prefill-tokens: 65536 + chunked-prefill-size: 262144 + + # Request handling + load-balance-method: "round_robin" + + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + watchdog-timeout: 1000000 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-running-requests: 512 + cuda-graph-max-bs: 512 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" + diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-tp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-tp.yaml new file mode 100644 index 000000000..af5aded2c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/bs256-1p6d-tp.yaml @@ -0,0 +1,115 @@ +name: "bs256-1p6d-h200-fp8" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 6 + decode_workers: 6 + gpus_per_node: 8 + +backend: + + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + watchdog-timeout: 1000000 + max-running-requests: 512 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.7 + max-prefill-tokens: 163840 + chunked-prefill-size: 163840 + + # Request handling + load-balance-method: "round_robin" + + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + watchdog-timeout: 1000000 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-running-requests: 512 + cuda-graph-max-bs: 512 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" + diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/low-latency-1p9d.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/low-latency-1p9d.yaml new file mode 100644 index 000000000..9cfc153f2 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/1k1k/disagg/stp/low-latency-1p9d.yaml @@ -0,0 +1,113 @@ +name: "low-latency-1p9d-h200-fp8" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 9 + decode_workers: 9 + gpus_per_node: 8 + +backend: + + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + watchdog-timeout: 1000000 + max-running-requests: 256 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-prefill-tokens: 163840 + chunked-prefill-size: 163840 + + # Request handling + load-balance-method: "round_robin" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + watchdog-timeout: 1000000 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-running-requests: 256 + cuda-graph-max-bs: 256 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "80" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs128-1p1d-dep-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs128-1p1d-dep-mtp.yaml new file mode 100644 index 000000000..292289a7e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs128-1p1d-dep-mtp.yaml @@ -0,0 +1,125 @@ +name: "bs128-1p1d-dep-h200-fp8-mtp" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 1 + decode_workers: 1 + gpus_per_node: 8 + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + # Decode-specific environment variables + decode_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + max-running-requests: 16 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + max-prefill-tokens: 163840 + chunked-prefill-size: 163840 + + # Request handling + load-balance-method: "round_robin" + + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.85 + max-running-requests: 192 + cuda-graph-max-bs: 192 + + # MTP settings + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs16-1p3d-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs16-1p3d-mtp.yaml new file mode 100644 index 000000000..76d9f6b1f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs16-1p3d-mtp.yaml @@ -0,0 +1,123 @@ +name: "bs16-1p3d-h200-fp8-mtp" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 3 + decode_workers: 3 + gpus_per_node: 8 + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + # Decode-specific environment variables + decode_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + max-running-requests: 16 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-prefill-tokens: 32768 + chunked-prefill-size: 32768 + + # Request handling + load-balance-method: "round_robin" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-running-requests: 32 + cuda-graph-max-bs: 32 + + # MTP settings + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs4-1p7d-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs4-1p7d-mtp.yaml new file mode 100644 index 000000000..01a278260 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs4-1p7d-mtp.yaml @@ -0,0 +1,123 @@ +name: "bs4-1p7d-h200-fp8-mtp" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 7 + decode_workers: 7 + gpus_per_node: 8 + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + # Decode-specific environment variables + decode_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + max-running-requests: 16 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-prefill-tokens: 32768 + chunked-prefill-size: 32768 + + # Request handling + load-balance-method: "round_robin" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + max-running-requests: 2 + cuda-graph-max-bs: 2 + + # MTP settings + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "64" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs64-2p3d-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs64-2p3d-mtp.yaml new file mode 100644 index 000000000..e426c78ba --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs64-2p3d-mtp.yaml @@ -0,0 +1,132 @@ +name: "bs64-2p3d-h200-fp8-mtp" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 2 + prefill_workers: 2 + decode_nodes: 3 + decode_workers: 3 + gpus_per_node: 8 + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + # Decode-specific environment variables + decode_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + max-running-requests: 16 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-prefill-tokens: 32768 + chunked-prefill-size: 32768 + + # Request handling + load-balance-method: "round_robin" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + context-length: 72000 + max-total-tokens: 128000 + # Memory and token limits + mem-fraction-static: 0.75 + max-running-requests: 16 + cuda-graph-max-bs: 16 + + # MTP settings + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "40" + +# benchmark: +# type: "gpqa" +# num_examples: 198 +# repeat: 4 +# num_threads: 32 +# max_tokens: 64000 diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs8-1p6d-mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs8-1p6d-mtp.yaml new file mode 100644 index 000000000..2922ba1df --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/mtp/bs8-1p6d-mtp.yaml @@ -0,0 +1,124 @@ +name: "bs8-1p6d-h200-fp8-mtp" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 6 + decode_workers: 6 + gpus_per_node: 8 + +backend: + + # Prefill-specific environment variables + prefill_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + # Decode-specific environment variables + decode_environment: + SGLANG_ENABLE_SPEC_V2: "1" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + max-running-requests: 16 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-prefill-tokens: 32768 + chunked-prefill-size: 32768 + + # Request handling + load-balance-method: "round_robin" + + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + watchdog-timeout: 1000000 + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-running-requests: 16 + cuda-graph-max-bs: 16 + + # MTP settings + speculative-algorithm: "EAGLE" + speculative-num-steps: 2 + speculative-eagle-topk: 1 + speculative-num-draft-tokens: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs128-1p1d-dep.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs128-1p1d-dep.yaml new file mode 100644 index 000000000..e86438436 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs128-1p1d-dep.yaml @@ -0,0 +1,116 @@ +name: "bs128-1p1d-dep-h200-fp8" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 1 + decode_workers: 1 + gpus_per_node: 8 + +backend: + + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + watchdog-timeout: 1000000 + max-running-requests: 16 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.75 + max-prefill-tokens: 163840 + chunked-prefill-size: 163840 + + # Request handling + load-balance-method: "round_robin" + + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 8 + ep-size: 8 + enable-dp-attention: true + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + watchdog-timeout: 1000000 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.88 + max-running-requests: 256 + cuda-graph-max-bs: 256 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs16-1p3d.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs16-1p3d.yaml new file mode 100644 index 000000000..75e36493b --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs16-1p3d.yaml @@ -0,0 +1,114 @@ +name: "bs16-1p3d-h200-fp8" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 3 + decode_workers: 3 + gpus_per_node: 8 + +backend: + + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + watchdog-timeout: 1000000 + max-running-requests: 16 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-prefill-tokens: 32768 + chunked-prefill-size: 32768 + + # Request handling + load-balance-method: "round_robin" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + watchdog-timeout: 1000000 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-running-requests: 32 + cuda-graph-max-bs: 32 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs4-1p7d.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs4-1p7d.yaml new file mode 100644 index 000000000..56aa58d11 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs4-1p7d.yaml @@ -0,0 +1,114 @@ +name: "bs4-1p7d-h200-fp8" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 7 + decode_workers: 7 + gpus_per_node: 8 + +backend: + + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + watchdog-timeout: 1000000 + max-running-requests: 16 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-prefill-tokens: 32768 + chunked-prefill-size: 32768 + + # Request handling + load-balance-method: "round_robin" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + watchdog-timeout: 1000000 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-running-requests: 8 + cuda-graph-max-bs: 8 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "64" + diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs64-2p3d.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs64-2p3d.yaml new file mode 100644 index 000000000..7c876e3cf --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs64-2p3d.yaml @@ -0,0 +1,122 @@ +name: "bs64-2p3d-h200-fp8" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 2 + prefill_workers: 2 + decode_nodes: 3 + decode_workers: 3 + gpus_per_node: 8 + +backend: + + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + watchdog-timeout: 1000000 + max-running-requests: 16 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-prefill-tokens: 32768 + chunked-prefill-size: 32768 + + # Request handling + load-balance-method: "round_robin" + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + watchdog-timeout: 1000000 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + #context-length: 72000 + # max-total-tokens: 128000 + # Memory and token limits + mem-fraction-static: 0.82 + max-running-requests: 128 + cuda-graph-max-bs: 128 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "40" + +# benchmark: +# type: "gpqa" +# num_examples: 198 +# repeat: 4 +# num_threads: 32 +# max_tokens: 64000 \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs8-1p6d.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs8-1p6d.yaml new file mode 100644 index 000000000..5eeba8f61 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/h200-fp8/8k1k/disagg/stp/bs8-1p6d.yaml @@ -0,0 +1,115 @@ +name: "bs8-1p6d-h200-fp8" + +model: + path: "dsr1" + container: "lmsysorg/sglang:v0.5.8.post1-cu130" + precision: "fp8" + +frontend: + nginx_container: nginx + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_nodes: 6 + decode_workers: 6 + gpus_per_node: 8 + +backend: + + prefill_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + decode_environment: + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + + sglang_config: + prefill: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Radix cache disabled + disable-radix-cache: true + + # Other flags + # stream-interval: 50 + watchdog-timeout: 1000000 + max-running-requests: 16 + + + # Prefill-specific mode + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-prefill-tokens: 32768 + chunked-prefill-size: 32768 + + # Request handling + load-balance-method: "round_robin" + + + decode: + # Model configuration + served-model-name: "deepseek-ai/DeepSeek-R1" + model-path: "/model/" + skip-tokenizer-init: true + trust-remote-code: true + + # Parallelism + tp-size: 8 + dp-size: 1 + ep-size: 1 + + # KV cache and attention + attention-backend: "flashinfer" + + # Other flags + disable-radix-cache: true + stream-interval: 10 + watchdog-timeout: 1000000 + + # Disagg + disaggregation-bootstrap-port: 30001 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Memory and token limits + mem-fraction-static: 0.82 + max-running-requests: 16 + cuda-graph-max-bs: 16 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-R1" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" + diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml new file mode 100644 index 000000000..7e59b1617 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch64_eplb0_mtp2.yaml @@ -0,0 +1,128 @@ +name: "ctx1_gen2_dep8_batch64_eplb0_mtp2" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 2 + decode_nodes: 2 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 64 + max_num_tokens: 192 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 58 + - 60 + - 62 + - 64 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + +# InferenceX bench-serving wrapper, invoked via srt-slurm `benchmark.type: custom`. +# Most env (MODEL, ISL, OSL, CONC_LIST, DISAGG) is exported by +# benchmark-multinode-tmpl.yml and propagated through srtctl → srun → pyxis, +# so the recipe only carries per-recipe knobs that have no workflow source. +# See benchmarks/multi_node/srt_bench.sh for the full env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" # per prefill worker + DECODE_GPUS: "8" # per decode worker + TOTAL_GPUS: "20" # sum across all workers + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp3.yaml new file mode 100644 index 000000000..6b34b2fb7 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp3.yaml @@ -0,0 +1,123 @@ +name: "ctx1_gen5_dep8_batch16_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 13 + - 14 + - 15 + - 16 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "44" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml new file mode 100644 index 000000000..4445c953b --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml @@ -0,0 +1,118 @@ +name: "ctx1_gen5_tep8_batch1_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "44" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml new file mode 100644 index 000000000..b7d1c9260 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml @@ -0,0 +1,132 @@ +name: "ctx1_gen5_tep8_batch32_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 3 + - 4 + - 5 + - 8 + - 9 + - 10 + - 16 + - 17 + - 18 + - 29 + - 30 + - 31 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "44" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen4_dep8_batch128_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen4_dep8_batch128_eplb0_mtp1.yaml new file mode 100644 index 000000000..d5def7a35 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen4_dep8_batch128_eplb0_mtp1.yaml @@ -0,0 +1,126 @@ +name: "ctx3_gen4_dep8_batch128_eplb0_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 2 + prefill_workers: 3 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 4 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 128 + max_num_tokens: 256 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 122 + - 124 + - 126 + - 128 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "44" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch512_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch512_eplb0_mtp1.yaml new file mode 100644 index 000000000..dde552b51 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch512_eplb0_mtp1.yaml @@ -0,0 +1,132 @@ +name: "ctx3_gen5_dep4_batch512_eplb0_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 2 + prefill_workers: 3 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 3 + gpus_per_decode: 4 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 512 + max_num_tokens: 1024 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + - 192 + - 256 + - 384 + - 448 + - 506 + - 508 + - 510 + - 512 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "4" + TOTAL_GPUS: "32" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml new file mode 100644 index 000000000..275c140a5 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml @@ -0,0 +1,123 @@ +name: "ctx1_gen1_dep8_batch512_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 512 + max_num_tokens: 512 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + - 256 + - 384 + - 448 + - 508 + - 510 + - 512 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "12" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch128_eplb0_mtp0.yaml new file mode 100644 index 000000000..ae7ba8483 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch128_eplb0_mtp0.yaml @@ -0,0 +1,120 @@ +name: "ctx1_gen2_dep8_batch128_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 2 + decode_nodes: 2 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 122 + - 124 + - 126 + - 128 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "20" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_dep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_dep8_batch32_eplb0_mtp0.yaml new file mode 100644 index 000000000..16961a5e0 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_dep8_batch32_eplb0_mtp0.yaml @@ -0,0 +1,118 @@ +name: "ctx1_gen5_dep8_batch32_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 26 + - 28 + - 30 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "44" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml new file mode 100644 index 000000000..ac84ded85 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml @@ -0,0 +1,112 @@ +name: "ctx1_gen5_tep8_batch1_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "44" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch32_eplb0_mtp0.yaml new file mode 100644 index 000000000..930f2520f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch32_eplb0_mtp0.yaml @@ -0,0 +1,133 @@ +name: "ctx1_gen5_tep8_batch32_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 3 + - 4 + - 5 + - 8 + - 9 + - 10 + - 11 + - 12 + - 13 + - 14 + - 15 + - 16 + - 18 + - 20 + - 22 + - 24 + - 26 + - 28 + - 30 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "44" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen6_tep8_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen6_tep8_batch64_eplb0_mtp0.yaml new file mode 100644 index 000000000..d90c6f3b0 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/1k1k/disagg/stp/ctx1_gen6_tep8_batch64_eplb0_mtp0.yaml @@ -0,0 +1,122 @@ +name: "ctx1_gen6_tep8_batch64_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 6 + decode_nodes: 6 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 56 + - 58 + - 60 + - 62 + - 64 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "52" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen1_dep8_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen1_dep8_batch8_eplb0_mtp3.yaml new file mode 100644 index 000000000..1017f8feb --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen1_dep8_batch8_eplb0_mtp3.yaml @@ -0,0 +1,122 @@ +name: "ctx1_gen1_dep8_batch8_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 5 + - 6 + - 7 + - 8 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "12" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch16_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch16_eplb0_mtp3.yaml new file mode 100644 index 000000000..4c919e2e1 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch16_eplb0_mtp3.yaml @@ -0,0 +1,129 @@ +name: "ctx1_gen3_tep8_batch16_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 3 + decode_nodes: 3 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 9 + - 10 + - 11 + - 12 + - 13 + - 14 + - 15 + - 16 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "28" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml new file mode 100644 index 000000000..dec75f377 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml @@ -0,0 +1,118 @@ +name: "ctx1_gen5_tep8_batch1_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "44" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch8_eplb0_mtp3.yaml new file mode 100644 index 000000000..1c8582c31 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx1_gen5_tep8_batch8_eplb0_mtp3.yaml @@ -0,0 +1,125 @@ +name: "ctx1_gen5_tep8_batch8_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 3 + - 4 + - 5 + - 6 + - 7 + - 8 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "44" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3.yaml new file mode 100644 index 000000000..37ab36d1f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3.yaml @@ -0,0 +1,126 @@ +name: "ctx3_gen1_dep8_batch64_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 2 + prefill_workers: 3 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 64 + max_num_tokens: 256 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 48 + - 56 + - 60 + - 62 + - 64 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "20" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen1_dep8_batch192_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen1_dep8_batch192_eplb0_mtp1.yaml new file mode 100644 index 000000000..693c2221c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen1_dep8_batch192_eplb0_mtp1.yaml @@ -0,0 +1,130 @@ +name: "ctx5_gen1_dep8_batch192_eplb0_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 3 + prefill_workers: 5 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 192 + max_num_tokens: 384 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + - 130 + - 132 + - 134 + - 136 + - 138 + - 168 + - 192 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "28" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen2_dep8_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen2_dep8_batch32_eplb0_mtp3.yaml new file mode 100644 index 000000000..ffbc9ae61 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/mtp/ctx5_gen2_dep8_batch32_eplb0_mtp3.yaml @@ -0,0 +1,125 @@ +name: "ctx5_gen2_dep8_batch32_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 3 + prefill_workers: 5 + gpus_per_prefill: 4 + + decode_workers: 2 + decode_nodes: 2 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 20 + - 24 + - 28 + - 30 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml new file mode 100644 index 000000000..b2c967541 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch1_eplb0_mtp0.yaml @@ -0,0 +1,113 @@ +name: "ctx1_gen5_tep8_batch1_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "44" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch8_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch8_eplb0_mtp0.yaml new file mode 100644 index 000000000..0f88bb006 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx1_gen5_tep8_batch8_eplb0_mtp0.yaml @@ -0,0 +1,126 @@ +name: "ctx1_gen5_tep8_batch8_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 3 + - 4 + - 5 + - 6 + - 7 + - 8 + - 9 + - 10 + - 12 + - 13 + - 14 + - 15 + - 16 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "44" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx2_gen5_tep8_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx2_gen5_tep8_batch64_eplb0_mtp0.yaml new file mode 100644 index 000000000..738dd82ea --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx2_gen5_tep8_batch64_eplb0_mtp0.yaml @@ -0,0 +1,121 @@ +name: "ctx2_gen5_tep8_batch64_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 58 + - 60 + - 62 + - 64 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "48" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen1_dep8_batch192_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen1_dep8_batch192_eplb0_mtp0.yaml new file mode 100644 index 000000000..22681d23a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen1_dep8_batch192_eplb0_mtp0.yaml @@ -0,0 +1,124 @@ +name: "ctx4_gen1_dep8_batch192_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 2 + prefill_workers: 4 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 192 + max_num_tokens: 192 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + - 152 + - 160 + - 168 + - 176 + - 184 + - 190 + - 192 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen3_dep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen3_dep8_batch32_eplb0_mtp0.yaml new file mode 100644 index 000000000..6e233467a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx4_gen3_dep8_batch32_eplb0_mtp0.yaml @@ -0,0 +1,117 @@ +name: "ctx4_gen3_dep8_batch32_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 2 + prefill_workers: 4 + gpus_per_prefill: 4 + + decode_workers: 3 + decode_nodes: 3 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 28 + - 30 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "40" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx7_gen2_dep8_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx7_gen2_dep8_batch128_eplb0_mtp0.yaml new file mode 100644 index 000000000..99f0ea58f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp4/8k1k/disagg/stp/ctx7_gen2_dep8_batch128_eplb0_mtp0.yaml @@ -0,0 +1,120 @@ +name: "ctx7_gen2_dep8_batch128_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b200" + prefill_nodes: 4 + prefill_workers: 7 + gpus_per_prefill: 4 + + decode_workers: 2 + decode_nodes: 2 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 116 + - 120 + - 124 + - 128 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "44" + +frontend: + nginx_container: "nginx-sqsh" + type: "dynamo" + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch768_eplb0_mtp2_1600.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch768_eplb0_mtp2_1600.yaml new file mode 100644 index 000000000..0fbd25b82 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch768_eplb0_mtp2_1600.yaml @@ -0,0 +1,127 @@ +name: ctx1_gen2_dep8_batch768_eplb0_mtp2_1600 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 2 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1152 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: + enable_padding: true + max_batch_size: 768 + disable_overlap_scheduler: false + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 768 + max_num_tokens: 2304 + max_seq_len: 2176 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen3_dep8_batch384_eplb0_mtp3_1184.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen3_dep8_batch384_eplb0_mtp3_1184.yaml new file mode 100644 index 000000000..fe3ab4c6c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen3_dep8_batch384_eplb0_mtp3_1184.yaml @@ -0,0 +1,127 @@ +name: ctx1_gen3_dep8_batch384_eplb0_mtp3_1184 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 3 + decode_nodes: 3 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1152 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: + enable_padding: true + max_batch_size: 384 + disable_overlap_scheduler: false + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 384 + max_num_tokens: 1536 + max_seq_len: 2176 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen4_dep8_batch256_eplb0_mtp3_1024.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen4_dep8_batch256_eplb0_mtp3_1024.yaml new file mode 100644 index 000000000..ab8b4d1c6 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen4_dep8_batch256_eplb0_mtp3_1024.yaml @@ -0,0 +1,127 @@ +name: ctx1_gen4_dep8_batch256_eplb0_mtp3_1024 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 4 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1152 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: + enable_padding: true + max_batch_size: 256 + disable_overlap_scheduler: false + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 256 + max_num_tokens: 1024 + max_seq_len: 2176 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen7_dep8_batch128_eplb0_mtp3_896.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen7_dep8_batch128_eplb0_mtp3_896.yaml new file mode 100644 index 000000000..a2665a5a4 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen7_dep8_batch128_eplb0_mtp3_896.yaml @@ -0,0 +1,127 @@ +name: ctx1_gen7_dep8_batch128_eplb0_mtp3_896 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 7 + decode_nodes: 7 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1152 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: + enable_padding: true + max_batch_size: 128 + disable_overlap_scheduler: false + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 128 + max_num_tokens: 512 + max_seq_len: 2176 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "64" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_8.yaml new file mode 100644 index 000000000..057fcbd77 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_8.yaml @@ -0,0 +1,127 @@ +name: ctx1_gen8_tp8_batch1_eplb0_mtp3_8 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 8 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1152 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: + enable_padding: true + max_batch_size: 1 + disable_overlap_scheduler: false + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 2176 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "72" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch32_eplb0_mtp3_256.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch32_eplb0_mtp3_256.yaml new file mode 100644 index 000000000..e42404618 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch32_eplb0_mtp3_256.yaml @@ -0,0 +1,127 @@ +name: ctx1_gen8_tp8_batch32_eplb0_mtp3_256 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 8 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1152 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: + enable_padding: true + max_batch_size: 32 + disable_overlap_scheduler: false + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 32 + max_num_tokens: 256 + max_seq_len: 2176 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "72" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch4_eplb0_mtp3_32.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch4_eplb0_mtp3_32.yaml new file mode 100644 index 000000000..042c00923 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch4_eplb0_mtp3_32.yaml @@ -0,0 +1,127 @@ +name: ctx1_gen8_tp8_batch4_eplb0_mtp3_32 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 8 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1152 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: + enable_padding: true + max_batch_size: 4 + disable_overlap_scheduler: false + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 4 + max_num_tokens: 256 + max_seq_len: 2176 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "72" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch8_eplb0_mtp3_64.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch8_eplb0_mtp3_64.yaml new file mode 100644 index 000000000..9ad27278a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch8_eplb0_mtp3_64.yaml @@ -0,0 +1,127 @@ +name: ctx1_gen8_tp8_batch8_eplb0_mtp3_64 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 8 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1152 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: + enable_padding: true + max_batch_size: 8 + disable_overlap_scheduler: false + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 8 + max_num_tokens: 256 + max_seq_len: 2176 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "72" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4096.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4096.yaml new file mode 100644 index 000000000..65aeecbfa --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4096.yaml @@ -0,0 +1,121 @@ +name: ctx1_gen1_dep8_batch512_eplb0_mtp0_4096 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.2 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1152 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1152 + cuda_graph_config: + enable_padding: true + max_batch_size: 512 + disable_overlap_scheduler: false + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 512 + max_num_tokens: 4096 + max_seq_len: 2176 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 40 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_128.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_128.yaml new file mode 100644 index 000000000..6159a29ad --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_128.yaml @@ -0,0 +1,121 @@ +name: ctx1_gen3_tp8_batch1024_eplb0_mtp0_128 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 3 + decode_nodes: 3 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 1152 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1152 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 1152 + cuda_graph_config: + enable_padding: true + max_batch_size: 1024 + disable_overlap_scheduler: false + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 1024 + max_num_tokens: 4096 + max_seq_len: 2176 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_32.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_32.yaml new file mode 100644 index 000000000..58d800b6a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_32.yaml @@ -0,0 +1,121 @@ +name: ctx1_gen3_tp8_batch1024_eplb0_mtp0_32 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 3 + decode_nodes: 3 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 1152 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1152 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 1152 + cuda_graph_config: + enable_padding: true + max_batch_size: 12 + disable_overlap_scheduler: false + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 12 + max_num_tokens: 12 + max_seq_len: 2176 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_4.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_4.yaml new file mode 100644 index 000000000..0ed6396a0 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen3_tp8_batch1024_eplb0_mtp0_4.yaml @@ -0,0 +1,121 @@ +name: ctx1_gen3_tp8_batch1024_eplb0_mtp0_4 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 3 + decode_nodes: 3 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 1152 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1152 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 1152 + cuda_graph_config: + enable_padding: true + max_batch_size: 1 + disable_overlap_scheduler: false + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 2176 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen5_dep8_batch48_eplb0_mtp0_1920.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen5_dep8_batch48_eplb0_mtp0_1920.yaml new file mode 100644 index 000000000..875279c47 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx1_gen5_dep8_batch48_eplb0_mtp0_1920.yaml @@ -0,0 +1,121 @@ +name: ctx1_gen5_dep8_batch48_eplb0_mtp0_1920 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 1152 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1152 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 1152 + cuda_graph_config: + enable_padding: true + max_batch_size: 48 + disable_overlap_scheduler: false + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 48 + max_num_tokens: 4096 + max_seq_len: 2176 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "48" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx2_gen5_dep8_batch128_eplb0_mtp0_5152.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx2_gen5_dep8_batch128_eplb0_mtp0_5152.yaml new file mode 100644 index 000000000..c277966c4 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/1k1k/disagg/stp/ctx2_gen5_dep8_batch128_eplb0_mtp0_5152.yaml @@ -0,0 +1,121 @@ +name: ctx2_gen5_dep8_batch128_eplb0_mtp0_5152 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 8 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 1152 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1152 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 1152 + cuda_graph_config: + enable_padding: true + max_batch_size: 128 + disable_overlap_scheduler: false + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 128 + max_num_tokens: 4096 + max_seq_len: 2176 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch32_eplb0_mtp3_8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch32_eplb0_mtp3_8.yaml new file mode 100644 index 000000000..7f03ae1e3 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch32_eplb0_mtp3_8.yaml @@ -0,0 +1,129 @@ +name: ctx1_gen2_tp8_batch32_eplb0_mtp3_8 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 2 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.2 + max_batch_size: 1 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + attention_dp_config: + enable_balance: true + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 4 + disable_overlap_scheduler: false + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 4 + max_num_tokens: 16 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch16_eplb0_mtp3_64.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch16_eplb0_mtp3_64.yaml new file mode 100644 index 000000000..712a67416 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch16_eplb0_mtp3_64.yaml @@ -0,0 +1,129 @@ +name: ctx1_gen4_tp8_batch16_eplb0_mtp3_64 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 4 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.2 + max_batch_size: 1 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + attention_dp_config: + enable_balance: true + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 16 + disable_overlap_scheduler: false + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_48.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_48.yaml new file mode 100644 index 000000000..4212abd06 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_48.yaml @@ -0,0 +1,129 @@ +name: ctx1_gen6_tp8_batch8_eplb0_mtp3_48 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 6 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.2 + max_batch_size: 1 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + attention_dp_config: + enable_balance: true + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 8 + disable_overlap_scheduler: false + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_8.yaml new file mode 100644 index 000000000..f3e356085 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx1_gen6_tp8_batch8_eplb0_mtp3_8.yaml @@ -0,0 +1,129 @@ +name: ctx1_gen6_tp8_batch8_eplb0_mtp3_8 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 6 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.2 + max_batch_size: 1 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + attention_dp_config: + enable_balance: true + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 1 + disable_overlap_scheduler: false + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep8_batch32_eplb0_mtp3_288.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep8_batch32_eplb0_mtp3_288.yaml new file mode 100644 index 000000000..cda4cecfd --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep8_batch32_eplb0_mtp3_288.yaml @@ -0,0 +1,131 @@ +name: ctx2_gen1_dep8_batch32_eplb0_mtp3_288 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.2 + max_batch_size: 1 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + attention_dp_config: + batching_wait_iters: 0 + enable_balance: true + timeout_iters: 60 + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 32 + disable_overlap_scheduler: false + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 32 + max_num_tokens: 1024 + max_seq_len: 9344 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen3_dep8_batch8_eplb0_mtp3_224.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen3_dep8_batch8_eplb0_mtp3_224.yaml new file mode 100644 index 000000000..1cdb3af76 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx2_gen3_dep8_batch8_eplb0_mtp3_224.yaml @@ -0,0 +1,131 @@ +name: ctx2_gen3_dep8_batch8_eplb0_mtp3_224 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 8 + + decode_workers: 3 + decode_nodes: 3 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.2 + max_batch_size: 1 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + attention_dp_config: + batching_wait_iters: 0 + enable_balance: true + timeout_iters: 60 + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 8 + disable_overlap_scheduler: false + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 8 + max_num_tokens: 256 + max_seq_len: 9344 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep8_batch128_eplb0_mtp2_1088.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep8_batch128_eplb0_mtp2_1088.yaml new file mode 100644 index 000000000..359073927 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep8_batch128_eplb0_mtp2_1088.yaml @@ -0,0 +1,131 @@ +name: ctx4_gen1_dep8_batch128_eplb0_mtp2_1088 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 4 + prefill_workers: 4 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 1 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + attention_dp_config: + batching_wait_iters: 0 + enable_balance: true + timeout_iters: 60 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 128 + disable_overlap_scheduler: false + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 128 + max_num_tokens: 3072 + max_seq_len: 9344 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch128_eplb0_mtp0_128.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch128_eplb0_mtp0_128.yaml new file mode 100644 index 000000000..7a9a20391 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch128_eplb0_mtp0_128.yaml @@ -0,0 +1,121 @@ +name: ctx1_gen1_dep8_batch128_eplb0_mtp0_128 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.2 + max_batch_size: 1 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 128 + disable_overlap_scheduler: false + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 128 + max_num_tokens: 512 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_256.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_256.yaml new file mode 100644 index 000000000..3f93f9140 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_256.yaml @@ -0,0 +1,121 @@ +name: ctx1_gen1_dep8_batch256_eplb0_mtp0_256 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.2 + max_batch_size: 1 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 256 + disable_overlap_scheduler: false + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 256 + max_num_tokens: 512 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_tp8_batch1_eplb0_mtp0_1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_tp8_batch1_eplb0_mtp0_1.yaml new file mode 100644 index 000000000..ca1c1d60f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen1_tp8_batch1_eplb0_mtp0_1.yaml @@ -0,0 +1,123 @@ +name: ctx1_gen1_tp8_batch1_eplb0_mtp0_1 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 64 + disable_overlap_scheduler: true + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 64 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 1 + disable_overlap_scheduler: false + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0_128.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0_128.yaml new file mode 100644 index 000000000..6b03210e3 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0_128.yaml @@ -0,0 +1,121 @@ +name: ctx1_gen2_dep8_batch64_eplb0_mtp0_128 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 2 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.2 + max_batch_size: 1 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 64 + disable_overlap_scheduler: false + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 64 + max_num_tokens: 512 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_128.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_128.yaml new file mode 100644 index 000000000..38ed548da --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_128.yaml @@ -0,0 +1,122 @@ +name: ctx1_gen4_tp8_batch32_eplb0_mtp0_128 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 4 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.2 + max_batch_size: 1 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 32 + disable_overlap_scheduler: false + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 32 + max_num_tokens: 512 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_32.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_32.yaml new file mode 100644 index 000000000..f086c23c0 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch32_eplb0_mtp0_32.yaml @@ -0,0 +1,122 @@ +name: ctx1_gen4_tp8_batch32_eplb0_mtp0_32 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 4 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.2 + max_batch_size: 1 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 8 + disable_overlap_scheduler: false + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 8 + max_num_tokens: 8 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen6_tp8_batch16_eplb0_mtp0_96.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen6_tp8_batch16_eplb0_mtp0_96.yaml new file mode 100644 index 000000000..39f1bffd8 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx1_gen6_tp8_batch16_eplb0_mtp0_96.yaml @@ -0,0 +1,122 @@ +name: ctx1_gen6_tp8_batch16_eplb0_mtp0_96 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 6 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.2 + max_batch_size: 1 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 16 + disable_overlap_scheduler: false + enable_attention_dp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 16 + max_num_tokens: 512 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx2_gen1_dep8_batch640_eplb0_mtp0_640.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx2_gen1_dep8_batch640_eplb0_mtp0_640.yaml new file mode 100644 index 000000000..2b787d7f4 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b200-fp8/8k1k/disagg/stp/ctx2_gen1_dep8_batch640_eplb0_mtp0_640.yaml @@ -0,0 +1,121 @@ +name: ctx2_gen1_dep8_batch640_eplb0_mtp0_640 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b200" + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_RNDV_SCHEME: "put_zcopy" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.2 + max_batch_size: 1 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: DEFAULT + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 640 + disable_overlap_scheduler: false + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 640 + max_num_tokens: 512 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep8_batch64_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep8_batch64_eplb0_mtp3.yaml new file mode 100644 index 000000000..554db4ec4 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep8_batch64_eplb0_mtp3.yaml @@ -0,0 +1,133 @@ +name: "ctx1_gen1_dep8_batch64_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 64 + max_num_tokens: 256 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "10" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch16_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch16_eplb0_mtp3.yaml new file mode 100644 index 000000000..497739ac7 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen2_dep8_batch16_eplb0_mtp3.yaml @@ -0,0 +1,131 @@ +name: "ctx1_gen2_dep8_batch16_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 2 + decode_nodes: 2 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "18" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml new file mode 100644 index 000000000..0fbaeb745 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch1_eplb0_mtp3.yaml @@ -0,0 +1,129 @@ +name: "ctx1_gen5_tep8_batch1_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "42" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml new file mode 100644 index 000000000..2d9df253b --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx1_gen5_tep8_batch32_eplb0_mtp3.yaml @@ -0,0 +1,145 @@ +name: "ctx1_gen5_tep8_batch32_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 3 + - 4 + - 5 + - 8 + - 10 + - 11 + - 12 + - 16 + - 18 + - 20 + - 22 + - 23 + - 24 + - 28 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "42" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx2_gen1_dep8_batch256_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx2_gen1_dep8_batch256_eplb0_mtp1.yaml new file mode 100644 index 000000000..c356b1b19 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx2_gen1_dep8_batch256_eplb0_mtp1.yaml @@ -0,0 +1,135 @@ +name: "ctx2_gen1_dep8_batch256_eplb0_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 2 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 256 + max_num_tokens: 512 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + - 256 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "12" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch512_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch512_eplb0_mtp1.yaml new file mode 100644 index 000000000..5735ea337 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch512_eplb0_mtp1.yaml @@ -0,0 +1,136 @@ +name: "ctx5_gen2_dep8_batch512_eplb0_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 2 + prefill_workers: 5 + gpus_per_prefill: 2 + + decode_workers: 2 + decode_nodes: 2 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 512 + max_num_tokens: 1024 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + - 256 + - 512 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "26" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch768_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch768_eplb0_mtp1.yaml new file mode 100644 index 000000000..1eed2b318 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/mtp/ctx5_gen2_dep8_batch768_eplb0_mtp1.yaml @@ -0,0 +1,137 @@ +name: "ctx5_gen2_dep8_batch768_eplb0_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 2 + prefill_workers: 5 + gpus_per_prefill: 2 + + decode_workers: 2 + decode_nodes: 2 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 768 + max_num_tokens: 1536 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + - 256 + - 512 + - 768 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "26" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0.yaml new file mode 100644 index 000000000..7d11fb152 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen2_dep8_batch64_eplb0_mtp0.yaml @@ -0,0 +1,127 @@ +name: "ctx1_gen2_dep8_batch64_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 2 + decode_nodes: 2 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "18" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml new file mode 100644 index 000000000..458ce824d --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml @@ -0,0 +1,123 @@ +name: "ctx1_gen4_tep8_batch1_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 4 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "34" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml new file mode 100644 index 000000000..3e493c98e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml @@ -0,0 +1,127 @@ +name: "ctx1_gen5_tep4_batch4_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 5 + decode_nodes: 3 + gpus_per_decode: 4 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 4 + max_num_tokens: 4 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 3 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "4" + TOTAL_GPUS: "22" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch64_eplb0_mtp0.yaml new file mode 100644 index 000000000..adb4a8b79 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx1_gen5_tep8_batch64_eplb0_mtp0.yaml @@ -0,0 +1,142 @@ +name: "ctx1_gen5_tep8_batch64_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 3 + - 4 + - 5 + - 8 + - 10 + - 11 + - 12 + - 16 + - 18 + - 20 + - 22 + - 27 + - 32 + - 35 + - 39 + - 48 + - 56 + - 64 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "42" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch512_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch512_eplb0_mtp0.yaml new file mode 100644 index 000000000..8bd76075a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch512_eplb0_mtp0.yaml @@ -0,0 +1,130 @@ +name: "ctx2_gen1_dep8_batch512_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 2 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 512 + max_num_tokens: 512 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + - 256 + - 512 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "12" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen1_dep8_batch1024_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen1_dep8_batch1024_eplb0_mtp0.yaml new file mode 100644 index 000000000..76d4cd780 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen1_dep8_batch1024_eplb0_mtp0.yaml @@ -0,0 +1,135 @@ +name: "ctx3_gen1_dep8_batch1024_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 3 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1024 + max_num_tokens: 1024 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + - 256 + - 512 + - 768 + - 832 + - 896 + - 960 + - 1024 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "14" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen2_dep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen2_dep8_batch256_eplb0_mtp0.yaml new file mode 100644 index 000000000..3c0692530 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/1k1k/disagg/stp/ctx3_gen2_dep8_batch256_eplb0_mtp0.yaml @@ -0,0 +1,129 @@ +name: "ctx3_gen2_dep8_batch256_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 3 + gpus_per_prefill: 2 + + decode_workers: 2 + decode_nodes: 2 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 8 + max_num_tokens: 10240 + max_seq_len: 1044 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 2068 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + - 256 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "22" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml new file mode 100644 index 000000000..5f522818a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml @@ -0,0 +1,135 @@ +name: "ctx10_gen1_dep8_batch256_eplb0_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 3 + prefill_workers: 10 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 256 + max_num_tokens: 512 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + - 256 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "28" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep4_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep4_batch8_eplb0_mtp3.yaml new file mode 100644 index 000000000..41f443c22 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep4_batch8_eplb0_mtp3.yaml @@ -0,0 +1,133 @@ +name: "ctx1_gen4_tep4_batch8_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 2 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + gpus_per_node: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "4" + TOTAL_GPUS: "18" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml new file mode 100644 index 000000000..ff3bca726 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml @@ -0,0 +1,129 @@ +name: "ctx1_gen4_tep8_batch1_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 4 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "34" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml new file mode 100644 index 000000000..87c3c57b6 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml @@ -0,0 +1,132 @@ +name: "ctx1_gen4_tep8_batch4_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 4 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 4 + max_num_tokens: 16 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 3 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "34" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch16_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch16_eplb0_mtp3.yaml new file mode 100644 index 000000000..3f40345ca --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx3_gen1_dep8_batch16_eplb0_mtp3.yaml @@ -0,0 +1,131 @@ +name: "ctx3_gen1_dep8_batch16_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 3 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "14" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx9_gen1_dep8_batch128_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx9_gen1_dep8_batch128_eplb0_mtp1.yaml new file mode 100644 index 000000000..a52be413d --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/mtp/ctx9_gen1_dep8_batch128_eplb0_mtp1.yaml @@ -0,0 +1,134 @@ +name: "ctx9_gen1_dep8_batch128_eplb0_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 3 + prefill_workers: 9 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 128 + max_num_tokens: 256 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "26" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep4_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep4_batch32_eplb0_mtp0.yaml new file mode 100644 index 000000000..f515e9aba --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep4_batch32_eplb0_mtp0.yaml @@ -0,0 +1,129 @@ +name: "ctx1_gen3_tep4_batch32_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 3 + decode_nodes: 2 + gpus_per_decode: 4 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "4" + TOTAL_GPUS: "14" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml new file mode 100644 index 000000000..7a167eb80 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml @@ -0,0 +1,127 @@ +name: "ctx1_gen3_tep8_batch16_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 3 + decode_nodes: 3 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "26" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0.yaml new file mode 100644 index 000000000..36a6268eb --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0.yaml @@ -0,0 +1,136 @@ +name: "ctx1_gen3_tep8_batch1_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 3 + decode_nodes: 3 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + - 256 + - 512 + - 768 + - 1024 + - 2048 + - 1 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "26" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen4_tep4_batch2_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen4_tep4_batch2_eplb0_mtp0.yaml new file mode 100644 index 000000000..d184a95d5 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx1_gen4_tep4_batch2_eplb0_mtp0.yaml @@ -0,0 +1,124 @@ +name: "ctx1_gen4_tep4_batch2_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 2 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 2 + max_num_tokens: 2 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "4" + TOTAL_GPUS: "18" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx5_gen2_dep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx5_gen2_dep8_batch32_eplb0_mtp0.yaml new file mode 100644 index 000000000..bacd57645 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx5_gen2_dep8_batch32_eplb0_mtp0.yaml @@ -0,0 +1,126 @@ +name: "ctx5_gen2_dep8_batch32_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 2 + prefill_workers: 5 + gpus_per_prefill: 2 + + decode_workers: 2 + decode_nodes: 2 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "26" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx6_gen1_dep8_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx6_gen1_dep8_batch128_eplb0_mtp0.yaml new file mode 100644 index 000000000..923b32c05 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx6_gen1_dep8_batch128_eplb0_mtp0.yaml @@ -0,0 +1,134 @@ +name: "ctx6_gen1_dep8_batch128_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 2 + prefill_workers: 6 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + - 256 + - 512 + - 768 + - 1024 + - 2048 + - 128 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "20" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx8_gen1_dep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx8_gen1_dep8_batch256_eplb0_mtp0.yaml new file mode 100644 index 000000000..1173417cc --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp4/8k1k/disagg/stp/ctx8_gen1_dep8_batch256_eplb0_mtp0.yaml @@ -0,0 +1,134 @@ +name: "ctx8_gen1_dep8_batch256_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 2 + prefill_workers: 8 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16896 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 32 + - 64 + - 128 + - 256 + - 512 + - 768 + - 1024 + - 2048 + - 256 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8448 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml new file mode 100644 index 000000000..9e1da3cf3 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml @@ -0,0 +1,139 @@ +name: ctx1_gen1_dp8_batch256_eplb0_mtp1_3072 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 1280 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: true + max_batch_size: 256 + disable_overlap_scheduler: false + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 256 + max_num_tokens: 2100 + max_seq_len: 2400 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "12" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch128_eplb0_mtp1_2560.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch128_eplb0_mtp1_2560.yaml new file mode 100644 index 000000000..d1ccc8b44 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch128_eplb0_mtp1_2560.yaml @@ -0,0 +1,139 @@ +name: ctx1_gen2_dep8_batch128_eplb0_mtp1_2560 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 2 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 1280 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: true + max_batch_size: 128 + disable_overlap_scheduler: false + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 128 + max_num_tokens: 1100 + max_seq_len: 2400 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "20" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp2_720.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp2_720.yaml new file mode 100644 index 000000000..74802bbc7 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp2_720.yaml @@ -0,0 +1,139 @@ +name: ctx1_gen5_dep8_batch16_eplb0_mtp2_720 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 1280 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: true + max_batch_size: 16 + disable_overlap_scheduler: false + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 16 + max_num_tokens: 180 + max_seq_len: 2400 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "44" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch16_eplb0_mtp3_160.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch16_eplb0_mtp3_160.yaml new file mode 100644 index 000000000..4a09efd68 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch16_eplb0_mtp3_160.yaml @@ -0,0 +1,140 @@ +name: ctx1_gen8_tp8_batch16_eplb0_mtp3_160 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 8 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 1280 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: true + max_batch_size: 16 + disable_overlap_scheduler: false + enable_attention_dp: false + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 16 + max_num_tokens: 384 + max_seq_len: 2400 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "68" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_10.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_10.yaml new file mode 100644 index 000000000..a6cbb9b66 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_10.yaml @@ -0,0 +1,140 @@ +name: ctx1_gen8_tp8_batch1_eplb0_mtp3_10 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 8 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 1280 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: true + max_batch_size: 1 + disable_overlap_scheduler: false + enable_attention_dp: false + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 2400 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "68" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx3_gen2_dp8_batch512_eplb0_mtp1_11264.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx3_gen2_dp8_batch512_eplb0_mtp1_11264.yaml new file mode 100644 index 000000000..7ccdfa4af --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/mtp/ctx3_gen2_dp8_batch512_eplb0_mtp1_11264.yaml @@ -0,0 +1,139 @@ +name: ctx3_gen2_dp8_batch512_eplb0_mtp1_11264 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 2 + prefill_workers: 3 + gpus_per_prefill: 4 + + decode_workers: 2 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 1280 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: true + max_batch_size: 512 + disable_overlap_scheduler: false + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 512 + max_num_tokens: 4200 + max_seq_len: 2400 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "28" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_2112.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_2112.yaml new file mode 100644 index 000000000..fa0675ade --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch256_eplb0_mtp0_2112.yaml @@ -0,0 +1,133 @@ +name: ctx1_gen1_dep8_batch256_eplb0_mtp0_2112 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 1280 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: true + max_batch_size: 256 + disable_overlap_scheduler: false + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 256 + max_num_tokens: 2048 + max_seq_len: 2400 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "12" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen2_dp8_batch128_eplb0_mtp0_3072.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen2_dp8_batch128_eplb0_mtp0_3072.yaml new file mode 100644 index 000000000..121844730 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen2_dp8_batch128_eplb0_mtp0_3072.yaml @@ -0,0 +1,133 @@ +name: ctx1_gen2_dp8_batch128_eplb0_mtp0_3072 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 2 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 1280 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: true + max_batch_size: 128 + disable_overlap_scheduler: false + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 128 + max_num_tokens: 1024 + max_seq_len: 2400 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "20" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen3_dp8_batch48_eplb0_mtp0_1280.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen3_dp8_batch48_eplb0_mtp0_1280.yaml new file mode 100644 index 000000000..7a7b2e1fe --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen3_dp8_batch48_eplb0_mtp0_1280.yaml @@ -0,0 +1,133 @@ +name: ctx1_gen3_dp8_batch48_eplb0_mtp0_1280 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 3 + decode_nodes: 3 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 1280 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: true + max_batch_size: 48 + disable_overlap_scheduler: false + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 48 + max_num_tokens: 384 + max_seq_len: 2400 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "28" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_12.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_12.yaml new file mode 100644 index 000000000..0e75f3747 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_12.yaml @@ -0,0 +1,134 @@ +name: ctx1_gen8_tp8_batch64_eplb0_mtp0_12 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 8 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 1280 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: true + max_batch_size: 1 + disable_overlap_scheduler: false + enable_attention_dp: false + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 2400 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "68" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_128.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_128.yaml new file mode 100644 index 000000000..384ef6e0c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_128.yaml @@ -0,0 +1,134 @@ +name: ctx1_gen8_tp8_batch64_eplb0_mtp0_128 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 8 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 1280 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: true + max_batch_size: 64 + disable_overlap_scheduler: false + enable_attention_dp: false + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 2400 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "68" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_384.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_384.yaml new file mode 100644 index 000000000..5fb7781d4 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx1_gen8_tp8_batch64_eplb0_mtp0_384.yaml @@ -0,0 +1,134 @@ +name: ctx1_gen8_tp8_batch64_eplb0_mtp0_384 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 8 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 1280 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: true + max_batch_size: 64 + disable_overlap_scheduler: false + enable_attention_dp: false + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 2400 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "68" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx2_gen1_dp8_batch1024_eplb0_mtp0_16384.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx2_gen1_dp8_batch1024_eplb0_mtp0_16384.yaml new file mode 100644 index 000000000..364b538d6 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/1k1k/disagg/stp/ctx2_gen1_dp8_batch1024_eplb0_mtp0_16384.yaml @@ -0,0 +1,133 @@ +name: ctx2_gen1_dp8_batch1024_eplb0_mtp0_16384 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 1280 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 1280 + cuda_graph_config: + enable_padding: true + max_batch_size: 1024 + disable_overlap_scheduler: false + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 1024 + max_num_tokens: 8192 + max_seq_len: 2400 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen1_dp8_batch8_eplb0_mtp3_72.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen1_dp8_batch8_eplb0_mtp3_72.yaml new file mode 100644 index 000000000..1039c9e2c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen1_dp8_batch8_eplb0_mtp3_72.yaml @@ -0,0 +1,139 @@ +name: ctx1_gen1_dp8_batch8_eplb0_mtp3_72 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 8 + disable_overlap_scheduler: false + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 8 + max_num_tokens: 90 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "12" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch16_eplb0_mtp3_40.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch16_eplb0_mtp3_40.yaml new file mode 100644 index 000000000..89a1abdd3 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch16_eplb0_mtp3_40.yaml @@ -0,0 +1,140 @@ +name: ctx1_gen2_tp8_batch16_eplb0_mtp3_40 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 2 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 16 + disable_overlap_scheduler: false + enable_attention_dp: false + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 16 + max_num_tokens: 80 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "20" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch1_eplb0_mtp3_8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch1_eplb0_mtp3_8.yaml new file mode 100644 index 000000000..87ad50002 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch1_eplb0_mtp3_8.yaml @@ -0,0 +1,140 @@ +name: ctx1_gen4_tp8_batch1_eplb0_mtp3_8 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 1 + disable_overlap_scheduler: false + enable_attention_dp: false + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch4_eplb0_mtp3_20.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch4_eplb0_mtp3_20.yaml new file mode 100644 index 000000000..4edbcf88d --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch4_eplb0_mtp3_20.yaml @@ -0,0 +1,140 @@ +name: ctx1_gen4_tp8_batch4_eplb0_mtp3_20 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 4 + disable_overlap_scheduler: false + enable_attention_dp: false + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 4 + max_num_tokens: 20 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx2_gen1_dp8_batch16_eplb0_mtp3_144.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx2_gen1_dp8_batch16_eplb0_mtp3_144.yaml new file mode 100644 index 000000000..7eba0cdd6 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx2_gen1_dp8_batch16_eplb0_mtp3_144.yaml @@ -0,0 +1,139 @@ +name: ctx2_gen1_dp8_batch16_eplb0_mtp3_144 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 16 + disable_overlap_scheduler: false + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 16 + max_num_tokens: 180 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx4_gen1_dp8_batch64_eplb0_mtp2_512.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx4_gen1_dp8_batch64_eplb0_mtp2_512.yaml new file mode 100644 index 000000000..555ec7688 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/mtp/ctx4_gen1_dp8_batch64_eplb0_mtp2_512.yaml @@ -0,0 +1,139 @@ +name: ctx4_gen1_dp8_batch64_eplb0_mtp2_512 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "b300" + prefill_nodes: 2 + prefill_workers: 4 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 64 + disable_overlap_scheduler: false + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 64 + max_num_tokens: 650 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch16_eplb0_mtp0_64.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch16_eplb0_mtp0_64.yaml new file mode 100644 index 000000000..8c9160c66 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen4_tp8_batch16_eplb0_mtp0_64.yaml @@ -0,0 +1,134 @@ +name: ctx1_gen4_tp8_batch16_eplb0_mtp0_64 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 16 + disable_overlap_scheduler: false + enable_attention_dp: false + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 16 + max_num_tokens: 512 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen8_tp8_batch2_eplb0_mtp0_16.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen8_tp8_batch2_eplb0_mtp0_16.yaml new file mode 100644 index 000000000..54de6c71f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx1_gen8_tp8_batch2_eplb0_mtp0_16.yaml @@ -0,0 +1,134 @@ +name: ctx1_gen8_tp8_batch2_eplb0_mtp0_16 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 8 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 1 + disable_overlap_scheduler: false + enable_attention_dp: false + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "68" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx2_gen1_dp8_batch32_eplb0_mtp0_256.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx2_gen1_dp8_batch32_eplb0_mtp0_256.yaml new file mode 100644 index 000000000..4e7808183 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx2_gen1_dp8_batch32_eplb0_mtp0_256.yaml @@ -0,0 +1,133 @@ +name: ctx2_gen1_dp8_batch32_eplb0_mtp0_256 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b300" + prefill_nodes: 1 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 32 + disable_overlap_scheduler: false + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 32 + max_num_tokens: 512 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen1_dp8_batch64_eplb0_mtp0_512.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen1_dp8_batch64_eplb0_mtp0_512.yaml new file mode 100644 index 000000000..6d6573b24 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen1_dp8_batch64_eplb0_mtp0_512.yaml @@ -0,0 +1,133 @@ +name: ctx3_gen1_dp8_batch64_eplb0_mtp0_512 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b300" + prefill_nodes: 2 + prefill_workers: 3 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 64 + disable_overlap_scheduler: false + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 64 + max_num_tokens: 512 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "20" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen5_tp8_batch64_eplb0_mtp0_256.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen5_tp8_batch64_eplb0_mtp0_256.yaml new file mode 100644 index 000000000..dd915b01d --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx3_gen5_tp8_batch64_eplb0_mtp0_256.yaml @@ -0,0 +1,134 @@ +name: ctx3_gen5_tp8_batch64_eplb0_mtp0_256 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b300" + prefill_nodes: 2 + prefill_workers: 3 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 64 + disable_overlap_scheduler: false + enable_attention_dp: false + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 64 + max_num_tokens: 512 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "52" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx5_gen1_dp8_batch128_eplb0_mtp0_1075.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx5_gen1_dp8_batch128_eplb0_mtp0_1075.yaml new file mode 100644 index 000000000..1e0375787 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx5_gen1_dp8_batch128_eplb0_mtp0_1075.yaml @@ -0,0 +1,133 @@ +name: ctx5_gen1_dp8_batch128_eplb0_mtp0_1075 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b300" + prefill_nodes: 3 + prefill_workers: 5 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 128 + disable_overlap_scheduler: false + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 128 + max_num_tokens: 512 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "28" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch384_eplb0_mtp0_3072.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch384_eplb0_mtp0_3072.yaml new file mode 100644 index 000000000..eb6170f6a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch384_eplb0_mtp0_3072.yaml @@ -0,0 +1,133 @@ +name: ctx7_gen1_dep8_batch384_eplb0_mtp0_3072 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "b300" + prefill_nodes: 4 + prefill_workers: 7 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 1 + gpus_per_decode: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + decode_environment: + NCCL_GRAPH_MIXING_SUPPORT: "0" + OMPI_MCA_coll_ucc_enable: "0" + TLLM_ALL_RANK_LOG: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + UCX_MAX_RMA_RAILS: "1" + UCX_MAX_RNDV_RAILS: "1" + UCX_RNDV_SCHEME: "put_zcopy" + OMPI_MCA_btl: "tcp,self" + OMPI_MCA_pml: "ob1" + TRTLLM_UCX_INTERFACE: "mlx5_0:1,mlx5_1:1,mlx5_10:1,mlx5_11:1,mlx5_16:1,mlx5_17:1,mlx5_20:1,mlx5_21:1,mlx5_22:1,mlx5_23:1,mlx5_4:1,mlx5_5:1,mlx5_8:1,mlx5_9:1,mlx5_2:1,mlx5_3:1" + + trtllm_config: + prefill: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: false + disable_overlap_scheduler: true + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + max_batch_size: 8 + max_num_tokens: 8320 + max_seq_len: 8320 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 1 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: AUTO + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8320 + cuda_graph_config: + enable_padding: true + max_batch_size: 384 + disable_overlap_scheduler: false + enable_attention_dp: true + enable_iter_perf_stats: false + enable_iter_req_stats: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 384 + max_num_tokens: 512 + max_seq_len: 9344 + moe_config: + backend: TRTLLM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 20 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false \ No newline at end of file diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3.yaml new file mode 100644 index 000000000..f6cb09bbc --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3.yaml @@ -0,0 +1,123 @@ +name: "ctx1_gen1_dep32_batch4_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 1 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 4 + max_num_tokens: 16 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml new file mode 100644 index 000000000..aa711f76c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml @@ -0,0 +1,127 @@ +name: "ctx1_gen4_tep8_batch8_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 4 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 3 + - 4 + - 5 + - 6 + - 7 + - 8 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx2_gen1_dep16_batch256_eplb256_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx2_gen1_dep16_batch256_eplb256_mtp1.yaml new file mode 100644 index 000000000..50a8aa6c4 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx2_gen1_dep16_batch256_eplb256_mtp1.yaml @@ -0,0 +1,158 @@ +name: "ctx2_gen1_dep16_batch256_eplb256_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 2 + + decode_workers: 1 + decode_nodes: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 256 + max_num_tokens: 512 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + num_slots: 256 + layer_updates_per_iter: 1 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch64_eplb288_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch64_eplb288_mtp1.yaml new file mode 100644 index 000000000..53fae254f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch64_eplb288_mtp1.yaml @@ -0,0 +1,134 @@ +name: "ctx3_gen1_dep32_batch64_eplb288_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 3 + prefill_workers: 3 + + decode_workers: 1 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 64 + max_num_tokens: 128 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + num_slots: 288 + layer_updates_per_iter: 1 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "44" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch768_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch768_eplb0_mtp1.yaml new file mode 100644 index 000000000..507a15f85 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/mtp/ctx3_gen5_dep4_batch768_eplb0_mtp1.yaml @@ -0,0 +1,219 @@ +name: "ctx3_gen5_dep4_batch768_eplb0_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 3 + prefill_workers: 3 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 768 + max_num_tokens: 1536 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + - 520 + - 528 + - 536 + - 544 + - 552 + - 560 + - 568 + - 576 + - 584 + - 592 + - 600 + - 608 + - 616 + - 624 + - 632 + - 640 + - 648 + - 656 + - 664 + - 672 + - 680 + - 688 + - 696 + - 704 + - 712 + - 720 + - 728 + - 736 + - 744 + - 752 + - 760 + - 768 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "4" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep32_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep32_batch16_eplb0_mtp0.yaml new file mode 100644 index 000000000..24294befe --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep32_batch16_eplb0_mtp0.yaml @@ -0,0 +1,119 @@ +name: "ctx1_gen1_dep32_batch16_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 1 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml new file mode 100644 index 000000000..67fd9d9a4 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0.yaml @@ -0,0 +1,181 @@ +name: "ctx1_gen1_dep8_batch512_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 1 + decode_nodes: 2 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 512 + max_num_tokens: 512 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "12" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen2_dep4_batch768_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen2_dep4_batch768_eplb0_mtp0.yaml new file mode 100644 index 000000000..57be7c35e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen2_dep4_batch768_eplb0_mtp0.yaml @@ -0,0 +1,213 @@ +name: "ctx1_gen2_dep4_batch768_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 2 + decode_nodes: 2 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 768 + max_num_tokens: 768 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + - 520 + - 528 + - 536 + - 544 + - 552 + - 560 + - 568 + - 576 + - 584 + - 592 + - 600 + - 608 + - 616 + - 624 + - 632 + - 640 + - 648 + - 656 + - 664 + - 672 + - 680 + - 688 + - 696 + - 704 + - 712 + - 720 + - 728 + - 736 + - 744 + - 752 + - 760 + - 768 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "4" + TOTAL_GPUS: "12" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml new file mode 100644 index 000000000..e8794eae8 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml @@ -0,0 +1,116 @@ +name: "ctx1_gen4_tep8_batch1_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 4 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml new file mode 100644 index 000000000..e9d59aaab --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml @@ -0,0 +1,131 @@ +name: "ctx1_gen4_tep8_batch32_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 4 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 3 + - 4 + - 5 + - 6 + - 8 + - 9 + - 10 + - 11 + - 12 + - 16 + - 22 + - 23 + - 24 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb256_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb256_mtp0.yaml new file mode 100644 index 000000000..c752a5600 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb256_mtp0.yaml @@ -0,0 +1,152 @@ +name: "ctx2_gen1_dep16_batch256_eplb256_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 2 + + decode_workers: 1 + decode_nodes: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + num_slots: 256 + layer_updates_per_iter: 1 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch64_eplb0_mtp0.yaml new file mode 100644 index 000000000..118580aa9 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch64_eplb0_mtp0.yaml @@ -0,0 +1,125 @@ +name: "ctx2_gen1_dep32_batch64_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 2 + + decode_workers: 1 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx11_gen1_dep16_batch256_eplb256_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx11_gen1_dep16_batch256_eplb256_mtp1.yaml new file mode 100644 index 000000000..0ccf95443 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx11_gen1_dep16_batch256_eplb256_mtp1.yaml @@ -0,0 +1,158 @@ +name: "ctx11_gen1_dep16_batch256_eplb256_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 11 + prefill_workers: 11 + + decode_workers: 1 + decode_nodes: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 256 + max_num_tokens: 512 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + num_slots: 256 + layer_updates_per_iter: 1 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "60" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml new file mode 100644 index 000000000..2854854f2 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml @@ -0,0 +1,129 @@ +name: "ctx1_gen4_tep8_batch8_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 4 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 3 + - 4 + - 5 + - 6 + - 7 + - 8 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep32_batch4_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep32_batch4_eplb0_mtp3.yaml new file mode 100644 index 000000000..bddcf060e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx3_gen1_dep32_batch4_eplb0_mtp3.yaml @@ -0,0 +1,123 @@ +name: "ctx3_gen1_dep32_batch4_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 3 + prefill_workers: 3 + + decode_workers: 1 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 4 + max_num_tokens: 16 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "44" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx7_gen1_dep16_batch64_eplb256_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx7_gen1_dep16_batch64_eplb256_mtp1.yaml new file mode 100644 index 000000000..eb101a191 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx7_gen1_dep16_batch64_eplb256_mtp1.yaml @@ -0,0 +1,134 @@ +name: "ctx7_gen1_dep16_batch64_eplb256_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 7 + prefill_workers: 7 + + decode_workers: 1 + decode_nodes: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 64 + max_num_tokens: 128 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + num_slots: 256 + layer_updates_per_iter: 1 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "44" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch16_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch16_eplb0_mtp3.yaml new file mode 100644 index 000000000..3bf47d0a8 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch16_eplb0_mtp3.yaml @@ -0,0 +1,125 @@ +name: "ctx8_gen1_dep32_batch16_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 8 + prefill_workers: 8 + + decode_workers: 1 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "64" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx10_gen1_dep16_batch256_eplb256_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx10_gen1_dep16_batch256_eplb256_mtp0.yaml new file mode 100644 index 000000000..7cfee6b2e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx10_gen1_dep16_batch256_eplb256_mtp0.yaml @@ -0,0 +1,152 @@ +name: "ctx10_gen1_dep16_batch256_eplb256_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 10 + prefill_workers: 10 + + decode_workers: 1 + decode_nodes: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + num_slots: 256 + layer_updates_per_iter: 1 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "56" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0.yaml new file mode 100644 index 000000000..a7e491533 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0.yaml @@ -0,0 +1,127 @@ +name: "ctx1_gen4_tep8_batch16_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 4 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 3 + - 4 + - 6 + - 8 + - 9 + - 10 + - 11 + - 14 + - 15 + - 16 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml new file mode 100644 index 000000000..fa6483998 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml @@ -0,0 +1,118 @@ +name: "ctx1_gen4_tep8_batch1_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 4 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0.yaml new file mode 100644 index 000000000..c0d6dc3f3 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0.yaml @@ -0,0 +1,118 @@ +name: "ctx2_gen1_dep32_batch8_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 2 + + decode_workers: 1 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 8 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + moe_config: + backend: WIDEEP + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch32_eplb0_mtp0.yaml new file mode 100644 index 000000000..b78f93a10 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch32_eplb0_mtp0.yaml @@ -0,0 +1,121 @@ +name: "ctx7_gen1_dep32_batch32_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 7 + prefill_workers: 7 + + decode_workers: 1 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "60" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx8_gen1_dep16_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx8_gen1_dep16_batch128_eplb0_mtp0.yaml new file mode 100644 index 000000000..080186d0f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp4/8k1k/disagg/stp/ctx8_gen1_dep16_batch128_eplb0_mtp0.yaml @@ -0,0 +1,133 @@ +name: "ctx8_gen1_dep16_batch128_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb200" + prefill_nodes: 8 + prefill_workers: 8 + + decode_workers: 1 + decode_nodes: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "48" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1_1229.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1_1229.yaml new file mode 100644 index 000000000..6ea81b176 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1_1229.yaml @@ -0,0 +1,133 @@ +name: ctx1_gen1_dep16_batch64_eplb0_mtp1_1229 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.5 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 64 + max_num_tokens: 128 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 16 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + stream_interval: 100 + tensor_parallel_size: 16 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "16" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch16_eplb0_mtp3_615.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch16_eplb0_mtp3_615.yaml new file mode 100644 index 000000000..8e5f86356 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch16_eplb0_mtp3_615.yaml @@ -0,0 +1,127 @@ +name: ctx1_gen1_dep32_batch16_eplb0_mtp3_615 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.5 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 32 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 32 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "32" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch256_eplb0_mtp1_2151.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch256_eplb0_mtp1_2151.yaml new file mode 100644 index 000000000..a96a862ef --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch256_eplb0_mtp1_2151.yaml @@ -0,0 +1,157 @@ +name: ctx1_gen1_dep8_batch256_eplb0_mtp1_2151 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.5 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 256 + max_num_tokens: 512 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch512_eplb0_mtp1_4301.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch512_eplb0_mtp1_4301.yaml new file mode 100644 index 000000000..449ca1d85 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen1_dep8_batch512_eplb0_mtp1_4301.yaml @@ -0,0 +1,189 @@ +name: ctx1_gen1_dep8_batch512_eplb0_mtp1_4301 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.5 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 512 + max_num_tokens: 1024 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_9.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_9.yaml new file mode 100644 index 000000000..e6f72bd07 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_9.yaml @@ -0,0 +1,126 @@ +name: ctx1_gen3_tep8_batch2_eplb0_mtp3_9 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 3 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.5 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 2 + max_num_tokens: 8 + max_seq_len: 2088 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_18.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_18.yaml new file mode 100644 index 000000000..519f5da0c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_18.yaml @@ -0,0 +1,126 @@ +name: ctx1_gen3_tep8_batch4_eplb0_mtp3_18 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 3 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.5 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 4 + max_num_tokens: 16 + max_seq_len: 2088 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3_36.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3_36.yaml new file mode 100644 index 000000000..23c1180d5 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3_36.yaml @@ -0,0 +1,127 @@ +name: ctx1_gen3_tep8_batch8_eplb0_mtp3_36 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 3 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.5 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 2088 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep16_batch128_eplb0_mtp0_2151.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep16_batch128_eplb0_mtp0_2151.yaml new file mode 100644 index 000000000..868c65032 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep16_batch128_eplb0_mtp0_2151.yaml @@ -0,0 +1,135 @@ +name: ctx1_gen1_dep16_batch128_eplb0_mtp0_2151 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.5 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 16 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 16 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "16" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch32_eplb0_mtp0_1127.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch32_eplb0_mtp0_1127.yaml new file mode 100644 index 000000000..64f1004f5 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch32_eplb0_mtp0_1127.yaml @@ -0,0 +1,123 @@ +name: ctx1_gen1_dep32_batch32_eplb0_mtp0_1127 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.5 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 32 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 32 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "32" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch8_eplb0_mtp0_256.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch8_eplb0_mtp0_256.yaml new file mode 100644 index 000000000..05f3d0763 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep32_batch8_eplb0_mtp0_256.yaml @@ -0,0 +1,120 @@ +name: ctx1_gen1_dep32_batch8_eplb0_mtp0_256 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.5 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 8 + max_num_tokens: 8 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 32 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 32 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "32" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4301.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4301.yaml new file mode 100644 index 000000000..5fcaf989c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch512_eplb0_mtp0_4301.yaml @@ -0,0 +1,183 @@ +name: ctx1_gen1_dep8_batch512_eplb0_mtp0_4301 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.5 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 512 + max_num_tokens: 512 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch768_eplb0_mtp0_6144.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch768_eplb0_mtp0_6144.yaml new file mode 100644 index 000000000..5f54ed0f7 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen1_dep8_batch768_eplb0_mtp0_6144.yaml @@ -0,0 +1,215 @@ +name: ctx1_gen1_dep8_batch768_eplb0_mtp0_6144 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.5 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + - 520 + - 528 + - 536 + - 544 + - 552 + - 560 + - 568 + - 576 + - 584 + - 592 + - 600 + - 608 + - 616 + - 624 + - 632 + - 640 + - 648 + - 656 + - 664 + - 672 + - 680 + - 688 + - 696 + - 704 + - 712 + - 720 + - 728 + - 736 + - 744 + - 752 + - 760 + - 768 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 768 + max_num_tokens: 768 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_3.yaml new file mode 100644 index 000000000..801c5214a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_3.yaml @@ -0,0 +1,120 @@ +name: ctx1_gen3_tep8_batch1_eplb0_mtp0_3 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 3 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.5 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 2088 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch8_eplb0_mtp0_27.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch8_eplb0_mtp0_27.yaml new file mode 100644 index 000000000..9c57a2897 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/1k1k/disagg/stp/ctx1_gen3_tep8_batch8_eplb0_mtp0_27.yaml @@ -0,0 +1,121 @@ +name: ctx1_gen3_tep8_batch8_eplb0_mtp0_27 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 3 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.5 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 8 + max_num_tokens: 8 + max_seq_len: 2088 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_6.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_6.yaml new file mode 100644 index 000000000..12632ffd1 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch2_eplb0_mtp3_6.yaml @@ -0,0 +1,126 @@ +name: ctx1_gen3_tep8_batch2_eplb0_mtp3_6 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 3 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 2 + max_num_tokens: 8 + max_seq_len: 9256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_15.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_15.yaml new file mode 100644 index 000000000..a80c790f9 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx1_gen3_tep8_batch4_eplb0_mtp3_15.yaml @@ -0,0 +1,126 @@ +name: ctx1_gen3_tep8_batch4_eplb0_mtp3_15 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 3 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 4 + max_num_tokens: 16 + max_seq_len: 9256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep32_batch2_eplb0_mtp3_90.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep32_batch2_eplb0_mtp3_90.yaml new file mode 100644 index 000000000..1f108d424 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx2_gen1_dep32_batch2_eplb0_mtp3_90.yaml @@ -0,0 +1,125 @@ +name: ctx2_gen1_dep32_batch2_eplb0_mtp3_90 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 4 + prefill_workers: 2 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 2 + max_num_tokens: 8 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 32 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 32 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "32" + TOTAL_GPUS: "48" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep16_batch16_eplb0_mtp3_333.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep16_batch16_eplb0_mtp3_333.yaml new file mode 100644 index 000000000..08f63213f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep16_batch16_eplb0_mtp3_333.yaml @@ -0,0 +1,127 @@ +name: ctx3_gen1_dep16_batch16_eplb0_mtp3_333 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 6 + prefill_workers: 3 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 16 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 16 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "16" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3_666.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3_666.yaml new file mode 100644 index 000000000..982765ae5 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx3_gen1_dep8_batch64_eplb0_mtp3_666.yaml @@ -0,0 +1,133 @@ +name: ctx3_gen1_dep8_batch64_eplb0_mtp3_666 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 6 + prefill_workers: 3 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 64 + max_num_tokens: 256 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep32_batch8_eplb0_mtp3_333.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep32_batch8_eplb0_mtp3_333.yaml new file mode 100644 index 000000000..6b286ce2e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx4_gen1_dep32_batch8_eplb0_mtp3_333.yaml @@ -0,0 +1,126 @@ +name: ctx4_gen1_dep32_batch8_eplb0_mtp3_333 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 8 + prefill_workers: 4 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 32 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 32 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "32" + TOTAL_GPUS: "64" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx5_gen1_dep16_batch32_eplb0_mtp3_666.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx5_gen1_dep16_batch32_eplb0_mtp3_666.yaml new file mode 100644 index 000000000..9bc424961 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/mtp/ctx5_gen1_dep16_batch32_eplb0_mtp3_666.yaml @@ -0,0 +1,129 @@ +name: ctx5_gen1_dep16_batch32_eplb0_mtp3_666 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 10 + prefill_workers: 5 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 16 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 16 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "16" + TOTAL_GPUS: "56" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0_63.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0_63.yaml new file mode 100644 index 000000000..0430ce4b1 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0_63.yaml @@ -0,0 +1,122 @@ +name: ctx1_gen3_tep8_batch16_eplb0_mtp0_63 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 3 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_6.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_6.yaml new file mode 100644 index 000000000..d1b526a07 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch1_eplb0_mtp0_6.yaml @@ -0,0 +1,120 @@ +name: ctx1_gen3_tep8_batch1_eplb0_mtp0_6 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 3 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 9256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch4_eplb0_mtp0_18.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch4_eplb0_mtp0_18.yaml new file mode 100644 index 000000000..fdf1e856c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx1_gen3_tep8_batch4_eplb0_mtp0_18.yaml @@ -0,0 +1,120 @@ +name: ctx1_gen3_tep8_batch4_eplb0_mtp0_18 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 2 + prefill_workers: 1 + gpus_per_prefill: 8 + + decode_workers: 3 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 4 + max_num_tokens: 4 + max_seq_len: 9256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0_333.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0_333.yaml new file mode 100644 index 000000000..2dffe83f1 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx2_gen1_dep32_batch8_eplb0_mtp0_333.yaml @@ -0,0 +1,120 @@ +name: ctx2_gen1_dep32_batch8_eplb0_mtp0_333 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 4 + prefill_workers: 2 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + max_batch_size: 8 + max_num_tokens: 8 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 32 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 32 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "32" + TOTAL_GPUS: "48" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx3_gen1_dep16_batch32_eplb0_mtp0_615.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx3_gen1_dep16_batch32_eplb0_mtp0_615.yaml new file mode 100644 index 000000000..ba7c6142f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx3_gen1_dep16_batch32_eplb0_mtp0_615.yaml @@ -0,0 +1,123 @@ +name: ctx3_gen1_dep16_batch32_eplb0_mtp0_615 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 6 + prefill_workers: 3 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 16 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 16 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "16" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx4_gen1_dep32_batch16_eplb0_mtp0_666.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx4_gen1_dep32_batch16_eplb0_mtp0_666.yaml new file mode 100644 index 000000000..8675bf58d --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx4_gen1_dep32_batch16_eplb0_mtp0_666.yaml @@ -0,0 +1,121 @@ +name: ctx4_gen1_dep32_batch16_eplb0_mtp0_666 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 8 + prefill_workers: 4 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 32 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 32 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "32" + TOTAL_GPUS: "64" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx5_gen1_dep16_batch64_eplb0_mtp0_1229.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx5_gen1_dep16_batch64_eplb0_mtp0_1229.yaml new file mode 100644 index 000000000..ca9b432d0 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb200-fp8/8k1k/disagg/stp/ctx5_gen1_dep16_batch64_eplb0_mtp0_1229.yaml @@ -0,0 +1,127 @@ +name: ctx5_gen1_dep16_batch64_eplb0_mtp0_1229 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb200" + prefill_nodes: 10 + prefill_workers: 5 + gpus_per_prefill: 8 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 8 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 16 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 16 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "16" + TOTAL_GPUS: "56" + +frontend: + type: "dynamo" + nginx_container: "nginx-sqsh" + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch8_eplb0_mtp.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch8_eplb0_mtp.yaml new file mode 100644 index 000000000..b3d1dd62a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep32_batch8_eplb0_mtp.yaml @@ -0,0 +1,127 @@ +name: "ctx1_gen1_dep32_batch8_eplb0_mtp" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "32" + TOTAL_GPUS: "34" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep4_batch768_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep4_batch768_eplb0_mtp1.yaml new file mode 100644 index 000000000..2b9d42408 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen1_dep4_batch768_eplb0_mtp1.yaml @@ -0,0 +1,222 @@ +name: "ctx1_gen1_dep4_batch768_eplb0_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 768 + max_num_tokens: 1536 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + - 520 + - 528 + - 536 + - 544 + - 552 + - 560 + - 568 + - 576 + - 584 + - 592 + - 600 + - 608 + - 616 + - 624 + - 632 + - 640 + - 648 + - 656 + - 664 + - 672 + - 680 + - 688 + - 696 + - 704 + - 712 + - 720 + - 728 + - 736 + - 744 + - 752 + - 760 + - 768 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "4" + TOTAL_GPUS: "6" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml new file mode 100644 index 000000000..c2c4c537a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml @@ -0,0 +1,125 @@ +name: "ctx1_gen4_tep8_batch1_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "34" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml new file mode 100644 index 000000000..da70d4074 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx1_gen4_tep8_batch8_eplb0_mtp3.yaml @@ -0,0 +1,130 @@ +name: "ctx1_gen4_tep8_batch8_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 3 + - 4 + - 6 + - 8 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "34" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep16_batch128_eplb256_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep16_batch128_eplb256_mtp1.yaml new file mode 100644 index 000000000..12174174c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep16_batch128_eplb256_mtp1.yaml @@ -0,0 +1,145 @@ +name: "ctx3_gen1_dep16_batch128_eplb256_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 2 + prefill_workers: 3 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 128 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + num_slots: 256 + layer_updates_per_iter: 1 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "16" + TOTAL_GPUS: "22" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch32_eplb288_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch32_eplb288_mtp3.yaml new file mode 100644 index 000000000..502ae7cf2 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/mtp/ctx3_gen1_dep32_batch32_eplb288_mtp3.yaml @@ -0,0 +1,133 @@ +name: "ctx3_gen1_dep32_batch32_eplb288_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 2 + prefill_workers: 3 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + num_slots: 288 + layer_updates_per_iter: 1 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "32" + TOTAL_GPUS: "38" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml new file mode 100644 index 000000000..cba8a4f64 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml @@ -0,0 +1,119 @@ +name: "ctx1_gen4_tep8_batch1_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "34" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml new file mode 100644 index 000000000..794556055 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx1_gen4_tep8_batch32_eplb0_mtp0.yaml @@ -0,0 +1,133 @@ +name: "ctx1_gen4_tep8_batch32_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 3 + - 4 + - 6 + - 8 + - 10 + - 11 + - 12 + - 16 + - 18 + - 20 + - 24 + - 28 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "34" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0.yaml new file mode 100644 index 000000000..8249a5369 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0.yaml @@ -0,0 +1,123 @@ +name: "ctx2_gen1_dep32_batch32_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 2 + + decode_workers: 1 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "32" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch1024_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch1024_eplb0_mtp0.yaml new file mode 100644 index 000000000..5f96315ff --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx2_gen1_dep8_batch1024_eplb0_mtp0.yaml @@ -0,0 +1,247 @@ +name: "ctx2_gen1_dep8_batch1024_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 2 + + decode_workers: 1 + decode_nodes: 2 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1024 + max_num_tokens: 1024 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + - 520 + - 528 + - 536 + - 544 + - 552 + - 560 + - 568 + - 576 + - 584 + - 592 + - 600 + - 608 + - 616 + - 624 + - 632 + - 640 + - 648 + - 656 + - 664 + - 672 + - 680 + - 688 + - 696 + - 704 + - 712 + - 720 + - 728 + - 736 + - 744 + - 752 + - 760 + - 768 + - 776 + - 784 + - 792 + - 800 + - 808 + - 816 + - 824 + - 832 + - 840 + - 848 + - 856 + - 864 + - 872 + - 880 + - 888 + - 896 + - 904 + - 912 + - 920 + - 928 + - 936 + - 944 + - 952 + - 960 + - 968 + - 976 + - 984 + - 992 + - 1000 + - 1008 + - 1016 + - 1024 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "12" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep16_batch256_eplb256_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep16_batch256_eplb256_mtp0.yaml new file mode 100644 index 000000000..50f4f8f0f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep16_batch256_eplb256_mtp0.yaml @@ -0,0 +1,155 @@ +name: "ctx3_gen1_dep16_batch256_eplb256_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 2 + prefill_workers: 3 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + num_slots: 256 + layer_updates_per_iter: 1 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "16" + TOTAL_GPUS: "22" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep32_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep32_batch64_eplb0_mtp0.yaml new file mode 100644 index 000000000..9acddc31e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/1k1k/disagg/stp/ctx3_gen1_dep32_batch64_eplb0_mtp0.yaml @@ -0,0 +1,128 @@ +name: "ctx3_gen1_dep32_batch64_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 2 + prefill_workers: 3 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 2048 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "32" + TOTAL_GPUS: "38" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep16_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep16_batch32_eplb0_mtp3.yaml new file mode 100644 index 000000000..4d258c289 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep16_batch32_eplb0_mtp3.yaml @@ -0,0 +1,129 @@ +name: "ctx10_gen1_dep16_batch32_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 5 + prefill_workers: 10 + + decode_workers: 1 + decode_nodes: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "16" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml new file mode 100644 index 000000000..c10a8598b --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx10_gen1_dep8_batch256_eplb0_mtp1.yaml @@ -0,0 +1,157 @@ +name: "ctx10_gen1_dep8_batch256_eplb0_mtp1" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 5 + prefill_workers: 10 + + decode_workers: 1 + decode_nodes: 2 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 256 + max_num_tokens: 512 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "28" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx13_gen1_dep16_batch64_eplb256_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx13_gen1_dep16_batch64_eplb256_mtp3.yaml new file mode 100644 index 000000000..df0375f0e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx13_gen1_dep16_batch64_eplb256_mtp3.yaml @@ -0,0 +1,137 @@ +name: "ctx13_gen1_dep16_batch64_eplb256_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 7 + prefill_workers: 13 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 64 + max_num_tokens: 256 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + num_slots: 256 + layer_updates_per_iter: 1 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "16" + TOTAL_GPUS: "42" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3.yaml new file mode 100644 index 000000000..6ce834ce3 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen3_tep8_batch8_eplb0_mtp3.yaml @@ -0,0 +1,128 @@ +name: "ctx1_gen3_tep8_batch8_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 3 + decode_nodes: 6 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "26" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml new file mode 100644 index 000000000..53771a342 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3.yaml @@ -0,0 +1,125 @@ +name: "ctx1_gen4_tep8_batch1_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "34" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml new file mode 100644 index 000000000..b2349f421 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3.yaml @@ -0,0 +1,128 @@ +name: "ctx1_gen4_tep8_batch4_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 4 + max_num_tokens: 16 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 3 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "34" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx4_gen1_dep32_batch4_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx4_gen1_dep32_batch4_eplb0_mtp3.yaml new file mode 100644 index 000000000..ddd5641a9 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx4_gen1_dep32_batch4_eplb0_mtp3.yaml @@ -0,0 +1,125 @@ +name: "ctx4_gen1_dep32_batch4_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 2 + prefill_workers: 4 + + decode_workers: 1 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 4 + max_num_tokens: 16 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "32" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch8_eplb0_mtp3.yaml new file mode 100644 index 000000000..aaca79561 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/mtp/ctx8_gen1_dep32_batch8_eplb0_mtp3.yaml @@ -0,0 +1,126 @@ +name: "ctx8_gen1_dep32_batch8_eplb0_mtp3" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 4 + prefill_workers: 8 + + decode_workers: 1 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "32" + TOTAL_GPUS: "48" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx11_gen3_dep4_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx11_gen3_dep4_batch256_eplb0_mtp0.yaml new file mode 100644 index 000000000..f141a5005 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx11_gen3_dep4_batch256_eplb0_mtp0.yaml @@ -0,0 +1,152 @@ +name: "ctx11_gen3_dep4_batch256_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 6 + prefill_workers: 11 + gpus_per_prefill: 2 + + decode_workers: 3 + decode_nodes: 3 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "4" + TOTAL_GPUS: "34" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx14_gen1_dep16_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx14_gen1_dep16_batch128_eplb0_mtp0.yaml new file mode 100644 index 000000000..882083834 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx14_gen1_dep16_batch128_eplb0_mtp0.yaml @@ -0,0 +1,135 @@ +name: "ctx14_gen1_dep16_batch128_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 7 + prefill_workers: 14 + + decode_workers: 1 + decode_nodes: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "16" + TOTAL_GPUS: "44" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml new file mode 100644 index 000000000..e4568f7e1 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen3_tep8_batch16_eplb0_mtp0.yaml @@ -0,0 +1,123 @@ +name: "ctx1_gen3_tep8_batch16_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 3 + decode_nodes: 6 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "26" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml new file mode 100644 index 000000000..5a6e21737 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0.yaml @@ -0,0 +1,119 @@ +name: "ctx1_gen4_tep8_batch1_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "34" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch2_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch2_eplb0_mtp0.yaml new file mode 100644 index 000000000..4b8ad5a43 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen4_tep8_batch2_eplb0_mtp0.yaml @@ -0,0 +1,120 @@ +name: "ctx1_gen4_tep8_batch2_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 2 + max_num_tokens: 2 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + allreduce_strategy: MNNVL + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "8" + TOTAL_GPUS: "34" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml new file mode 100644 index 000000000..6f6194a84 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx1_gen5_tep4_batch4_eplb0_mtp0.yaml @@ -0,0 +1,120 @@ +name: "ctx1_gen5_tep4_batch4_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 4 + max_num_tokens: 4 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "4" + TOTAL_GPUS: "22" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch16_eplb0_mtp0.yaml new file mode 100644 index 000000000..f68b83534 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx7_gen1_dep32_batch16_eplb0_mtp0.yaml @@ -0,0 +1,122 @@ +name: "ctx7_gen1_dep32_batch16_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 4 + prefill_workers: 7 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "32" + TOTAL_GPUS: "46" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx9_gen1_dep16_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx9_gen1_dep16_batch64_eplb0_mtp0.yaml new file mode 100644 index 000000000..db6ae1b3f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp4/8k1k/disagg/stp/ctx9_gen1_dep16_batch64_eplb0_mtp0.yaml @@ -0,0 +1,128 @@ +name: "ctx9_gen1_dep16_batch64_eplb0_mtp0" + +model: + path: "dsr1" + container: "dynamo-trtllm" + precision: "fp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 5 + prefill_workers: 9 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_ENABLE_PDL: "1" + + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + moe_config: + backend: TRTLLM + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + dtype: fp8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + cache_transceiver_config: + max_tokens_in_buffer: 16384 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "2" + DECODE_GPUS: "16" + TOTAL_GPUS: "34" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false + +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp3_666.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp3_666.yaml new file mode 100644 index 000000000..f03320ce7 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp3_666.yaml @@ -0,0 +1,132 @@ +name: ctx1_gen1_dep16_batch32_eplb0_mtp3_666 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 16 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 16 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "20" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3_180.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3_180.yaml new file mode 100644 index 000000000..3783dd563 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen1_dep32_batch4_eplb0_mtp3_180.yaml @@ -0,0 +1,128 @@ +name: ctx1_gen1_dep32_batch4_eplb0_mtp3_180 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 4 + max_num_tokens: 16 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 32 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 32 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml new file mode 100644 index 000000000..d4cf77025 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml @@ -0,0 +1,129 @@ +name: ctx1_gen4_tep8_batch1_eplb0_mtp3_8 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 2088 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml new file mode 100644 index 000000000..e6d895550 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml @@ -0,0 +1,129 @@ +name: ctx1_gen4_tep8_batch4_eplb0_mtp3_24 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 4 + max_num_tokens: 16 + max_seq_len: 2088 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep16_batch128_eplb0_mtp1_2253.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep16_batch128_eplb0_mtp1_2253.yaml new file mode 100644 index 000000000..f178dc30a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep16_batch128_eplb0_mtp1_2253.yaml @@ -0,0 +1,144 @@ +name: ctx2_gen1_dep16_batch128_eplb0_mtp1_2253 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 128 + max_num_tokens: 256 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 16 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + stream_interval: 100 + tensor_parallel_size: 16 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep32_batch16_eplb0_mtp3_564.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep32_batch16_eplb0_mtp3_564.yaml new file mode 100644 index 000000000..562ada512 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx2_gen1_dep32_batch16_eplb0_mtp3_564.yaml @@ -0,0 +1,130 @@ +name: ctx2_gen1_dep32_batch16_eplb0_mtp3_564 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 32 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 32 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx3_gen2_dep8_batch512_eplb0_mtp1_8192.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx3_gen2_dep8_batch512_eplb0_mtp1_8192.yaml new file mode 100644 index 000000000..87ba559b2 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/mtp/ctx3_gen2_dep8_batch512_eplb0_mtp1_8192.yaml @@ -0,0 +1,192 @@ +name: ctx3_gen2_dep8_batch512_eplb0_mtp1_8192 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 3 + prefill_workers: 3 + gpus_per_prefill: 4 + + decode_workers: 2 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 512 + max_num_tokens: 1024 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "28" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0_84.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0_84.yaml new file mode 100644 index 000000000..57803a156 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch16_eplb0_mtp0_84.yaml @@ -0,0 +1,125 @@ +name: ctx1_gen4_tep8_batch16_eplb0_mtp0_84 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 2088 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml new file mode 100644 index 000000000..3f3905468 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml @@ -0,0 +1,123 @@ +name: ctx1_gen4_tep8_batch1_eplb0_mtp0_4 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 2088 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml new file mode 100644 index 000000000..6e2ba5e8e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml @@ -0,0 +1,123 @@ +name: ctx1_gen4_tep8_batch4_eplb0_mtp0_24 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 4 + max_num_tokens: 4 + max_seq_len: 2088 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch128_eplb0_mtp0_2253.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch128_eplb0_mtp0_2253.yaml new file mode 100644 index 000000000..2580bab99 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch128_eplb0_mtp0_2253.yaml @@ -0,0 +1,138 @@ +name: ctx2_gen1_dep16_batch128_eplb0_mtp0_2253 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 16 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 16 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0_1229.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0_1229.yaml new file mode 100644 index 000000000..c7dc2dcdd --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx2_gen1_dep32_batch32_eplb0_mtp0_1229.yaml @@ -0,0 +1,126 @@ +name: ctx2_gen1_dep32_batch32_eplb0_mtp0_1229 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 32 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 32 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch512_eplb0_mtp0_8602.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch512_eplb0_mtp0_8602.yaml new file mode 100644 index 000000000..c4613dbb2 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch512_eplb0_mtp0_8602.yaml @@ -0,0 +1,186 @@ +name: ctx3_gen2_dep8_batch512_eplb0_mtp0_8602 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 3 + prefill_workers: 3 + gpus_per_prefill: 4 + + decode_workers: 2 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 512 + max_num_tokens: 512 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "28" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch768_eplb0_mtp0_12288.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch768_eplb0_mtp0_12288.yaml new file mode 100644 index 000000000..bdc07bf9d --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/1k1k/disagg/stp/ctx3_gen2_dep8_batch768_eplb0_mtp0_12288.yaml @@ -0,0 +1,218 @@ +name: ctx3_gen2_dep8_batch768_eplb0_mtp0_12288 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 3 + prefill_workers: 3 + gpus_per_prefill: 4 + + decode_workers: 2 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + - 520 + - 528 + - 536 + - 544 + - 552 + - 560 + - 568 + - 576 + - 584 + - 592 + - 600 + - 608 + - 616 + - 624 + - 632 + - 640 + - 648 + - 656 + - 664 + - 672 + - 680 + - 688 + - 696 + - 704 + - 712 + - 720 + - 728 + - 736 + - 744 + - 752 + - 760 + - 768 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 768 + max_num_tokens: 768 + max_seq_len: 2088 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "28" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx10_gen1_dep16_batch64_eplb0_mtp1_1229.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx10_gen1_dep16_batch64_eplb0_mtp1_1229.yaml new file mode 100644 index 000000000..95a1bd02e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx10_gen1_dep16_batch64_eplb0_mtp1_1229.yaml @@ -0,0 +1,136 @@ +name: ctx10_gen1_dep16_batch64_eplb0_mtp1_1229 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 10 + prefill_workers: 10 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 64 + max_num_tokens: 128 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 16 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + stream_interval: 100 + tensor_parallel_size: 16 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "56" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml new file mode 100644 index 000000000..644b5a20b --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch1_eplb0_mtp3_8.yaml @@ -0,0 +1,129 @@ +name: ctx1_gen4_tep8_batch1_eplb0_mtp3_8 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 9256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml new file mode 100644 index 000000000..5c7a8ed5c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tep8_batch4_eplb0_mtp3_24.yaml @@ -0,0 +1,129 @@ +name: ctx1_gen4_tep8_batch4_eplb0_mtp3_24 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 4 + max_num_tokens: 16 + max_seq_len: 9256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx6_gen1_dep32_batch8_eplb0_mtp3_333.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx6_gen1_dep32_batch8_eplb0_mtp3_333.yaml new file mode 100644 index 000000000..c78705873 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx6_gen1_dep32_batch8_eplb0_mtp3_333.yaml @@ -0,0 +1,129 @@ +name: ctx6_gen1_dep32_batch8_eplb0_mtp3_333 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 6 + prefill_workers: 6 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 32 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 32 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "56" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx7_gen1_dep8_batch128_eplb0_mtp1_1229.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx7_gen1_dep8_batch128_eplb0_mtp1_1229.yaml new file mode 100644 index 000000000..e00287de7 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx7_gen1_dep8_batch128_eplb0_mtp1_1229.yaml @@ -0,0 +1,144 @@ +name: ctx7_gen1_dep8_batch128_eplb0_mtp1_1229 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 7 + prefill_workers: 7 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 128 + max_num_tokens: 256 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx8_gen1_dep16_batch32_eplb0_mtp3_666.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx8_gen1_dep16_batch32_eplb0_mtp3_666.yaml new file mode 100644 index 000000000..162f003e4 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/mtp/ctx8_gen1_dep16_batch32_eplb0_mtp3_666.yaml @@ -0,0 +1,132 @@ +name: ctx8_gen1_dep16_batch32_eplb0_mtp3_666 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 8 + prefill_workers: 8 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 16 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + stream_interval: 100 + tensor_parallel_size: 16 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "48" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml new file mode 100644 index 000000000..3a470113e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch1_eplb0_mtp0_4.yaml @@ -0,0 +1,123 @@ +name: ctx1_gen4_tep8_batch1_eplb0_mtp0_4 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 9256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml new file mode 100644 index 000000000..8b14ffd93 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch4_eplb0_mtp0_24.yaml @@ -0,0 +1,123 @@ +name: ctx1_gen4_tep8_batch4_eplb0_mtp0_24 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 4 + max_num_tokens: 4 + max_seq_len: 9256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch8_eplb0_mtp0_36.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch8_eplb0_mtp0_36.yaml new file mode 100644 index 000000000..f5994c054 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx1_gen4_tep8_batch8_eplb0_mtp0_36.yaml @@ -0,0 +1,124 @@ +name: ctx1_gen4_tep8_batch8_eplb0_mtp0_36 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + allreduce_strategy: MNNVL + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + enable_padding: true + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + max_batch_size: 8 + max_num_tokens: 8 + max_seq_len: 9256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx4_gen1_dep16_batch32_eplb0_mtp0_666.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx4_gen1_dep16_batch32_eplb0_mtp0_666.yaml new file mode 100644 index 000000000..fcf7292da --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx4_gen1_dep16_batch32_eplb0_mtp0_666.yaml @@ -0,0 +1,126 @@ +name: ctx4_gen1_dep16_batch32_eplb0_mtp0_666 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 4 + prefill_workers: 4 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 16 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 16 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx6_gen1_dep32_batch16_eplb0_mtp0_512.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx6_gen1_dep32_batch16_eplb0_mtp0_512.yaml new file mode 100644 index 000000000..ac8d6faa6 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx6_gen1_dep32_batch16_eplb0_mtp0_512.yaml @@ -0,0 +1,124 @@ +name: ctx6_gen1_dep32_batch16_eplb0_mtp0_512 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 6 + prefill_workers: 6 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 32 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 32 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "56" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep16_batch64_eplb0_mtp0_1229.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep16_batch64_eplb0_mtp0_1229.yaml new file mode 100644 index 000000000..e585cc065 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep16_batch64_eplb0_mtp0_1229.yaml @@ -0,0 +1,130 @@ +name: ctx7_gen1_dep16_batch64_eplb0_mtp0_1229 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 7 + prefill_workers: 7 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 16 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 16 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "44" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch256_eplb0_mtp0_2151.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch256_eplb0_mtp0_2151.yaml new file mode 100644 index 000000000..87272ba14 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/gb300-fp8/8k1k/disagg/stp/ctx7_gen1_dep8_batch256_eplb0_mtp0_2151.yaml @@ -0,0 +1,154 @@ +name: ctx7_gen1_dep8_batch256_eplb0_mtp0_2151 + +model: + path: "dsr1-fp8" + container: "dynamo-trtllm" + precision: "fp8" + +resources: + gpu_type: "gb300" + prefill_nodes: 7 + prefill_workers: 7 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + TLLM_OVERRIDE_LAYER_NUM: "61" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + TRTLLM_ENABLE_PDL: "1" + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TRTLLM_FORCE_COMM_METHOD: "NVLINK_TWO_SIDED" + ENABLE_CONFIGURABLE_MOE: "1" + + trtllm_config: + prefill: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: null + disable_overlap_scheduler: true + enable_attention_dp: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.1 + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + moe_config: + backend: DEEPGEMM + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + print_iter_log: true + tensor_parallel_size: 4 + + + decode: + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + enable_padding: true + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 9256 + moe_config: + backend: DEEPGEMM + use_low_precision_moe_combine: true + moe_expert_parallel_size: 8 + num_postprocess_workers: 4 + pipeline_parallel_size: 1 + print_iter_log: true + stream_interval: 100 + tensor_parallel_size: 8 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + + enable_multiple_frontends: false + + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false +infra: + etcd_nats_dedicated_node: true diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp2.yaml new file mode 100644 index 000000000..67da71d3d --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch32_eplb0_mtp2.yaml @@ -0,0 +1,111 @@ +name: h100_1k1k_ctx1dep16_gen1dep16_batch32_eplb0_mtp2_chunked_false +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 1 + decode_nodes: 2 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + decode_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: WIDEEP + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "32" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1.yaml new file mode 100644 index 000000000..766d7fd79 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen1_dep16_batch64_eplb0_mtp1.yaml @@ -0,0 +1,115 @@ +name: h100_1k1k_ctx1dep16_gen1dep16_batch64_eplb0_mtp1_chunked_false +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 1 + decode_nodes: 2 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + decode_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 64 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: WIDEEP + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "32" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_dep16_batch4_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_dep16_batch4_eplb0_mtp3.yaml new file mode 100644 index 000000000..d2e17ac7a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_dep16_batch4_eplb0_mtp3.yaml @@ -0,0 +1,107 @@ +name: h100_1k1k_ctx1dep16_gen3dep16_batch4_eplb0_mtp3_chunked_false +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + decode_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 4 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: WIDEEP + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch128_eplb0_mtp3.yaml new file mode 100644 index 000000000..a48f9c94a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch128_eplb0_mtp3.yaml @@ -0,0 +1,120 @@ +name: h100_1k1k_ctx1dep16_gen3tep16_batch128_eplb0_mtp3_chunked_false +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 128 + max_num_tokens: 512 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch16_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch16_eplb0_mtp3.yaml new file mode 100644 index 000000000..c07b82fad --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch16_eplb0_mtp3.yaml @@ -0,0 +1,106 @@ +name: h100_1k1k_ctx1dep16_gen3tep16_batch16_eplb0_mtp3_chunked_false +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 16 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml new file mode 100644 index 000000000..d64e9777c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml @@ -0,0 +1,104 @@ +name: h100_1k1k_ctx1dep16_gen3tep16_batch1_eplb0_mtp3_chunked_false +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml new file mode 100644 index 000000000..077357b39 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml @@ -0,0 +1,104 @@ +name: h100_1k1k_ctx1dep16_gen3tep16_batch2_eplb0_mtp3_chunked_false +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 2 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch32_eplb0_mtp3.yaml new file mode 100644 index 000000000..414388c6b --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch32_eplb0_mtp3.yaml @@ -0,0 +1,108 @@ +name: h100_1k1k_ctx1dep16_gen3tep16_batch32_eplb0_mtp3_chunked_false +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml new file mode 100644 index 000000000..d49f37947 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml @@ -0,0 +1,105 @@ +name: h100_1k1k_ctx1dep16_gen3tep16_batch8_eplb0_mtp3_chunked_false +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch16_eplb0_mtp0.yaml new file mode 100644 index 000000000..1624bcc3e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch16_eplb0_mtp0.yaml @@ -0,0 +1,103 @@ +name: ctx1dep16_gen3dep16_batch16_eplb0_mtp0 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + decode_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: true + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 16 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: WIDEEP + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch32_eplb0_mtp0.yaml new file mode 100644 index 000000000..f632508e1 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch32_eplb0_mtp0.yaml @@ -0,0 +1,105 @@ +name: ctx1dep16_gen3dep16_batch32_eplb0_mtp0 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + decode_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: WIDEEP + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch4_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch4_eplb0_mtp0.yaml new file mode 100644 index 000000000..6cd4b7697 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch4_eplb0_mtp0.yaml @@ -0,0 +1,101 @@ +name: ctx1dep16_gen3dep16_batch4_eplb0_mtp0 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + decode_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 4 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: WIDEEP + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch8_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch8_eplb0_mtp0.yaml new file mode 100644 index 000000000..10ab482b3 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_dep16_batch8_eplb0_mtp0.yaml @@ -0,0 +1,102 @@ +name: ctx1dep16_gen3dep16_batch8_eplb0_mtp0 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + decode_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: WIDEEP + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch16_eplb0_mtp0.yaml new file mode 100644 index 000000000..850acc0da --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch16_eplb0_mtp0.yaml @@ -0,0 +1,100 @@ +name: ctx1dep16_gen3tep16_batch16_eplb0_mtp0 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 16 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml new file mode 100644 index 000000000..a1d5c9aac --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml @@ -0,0 +1,98 @@ +name: ctx1dep16_gen3tep16_batch1_eplb0_mtp0 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: true + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml new file mode 100644 index 000000000..c3b1144bd --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml @@ -0,0 +1,98 @@ +name: ctx1dep16_gen3tep16_batch2_eplb0_mtp0 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: true + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 2 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml new file mode 100644 index 000000000..2e972e14b --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml @@ -0,0 +1,99 @@ +name: ctx1dep16_gen3tep16_batch8_eplb0_mtp0 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: true + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb0_mtp0.yaml new file mode 100644 index 000000000..3dd8f5482 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/1k1k/disagg/stp/ctx2_gen1_dep16_batch256_eplb0_mtp0.yaml @@ -0,0 +1,133 @@ +name: ctx2dep16_gen1dep16_batch256_eplb0_mtp0 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 2 + prefill_nodes: 4 + decode_workers: 1 + decode_nodes: 2 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + decode_environment: + UCX_TLS: rc,dc,ud,cuda_copy,cuda_ipc,tcp + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 2 + max_num_tokens: 2048 + max_seq_len: 2048 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: true + moe_config: + backend: WIDEEP + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 2088 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: WIDEEP + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8192 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "48" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen1_dep16_batch4_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen1_dep16_batch4_eplb0_mtp3.yaml new file mode 100644 index 000000000..007d7e4eb --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen1_dep16_batch4_eplb0_mtp3.yaml @@ -0,0 +1,107 @@ +name: h100_8k1k_ctx1dep16_gen1dep16_batch4_eplb0_mtp3 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 1 + decode_nodes: 2 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + trtllm_config: + prefill: + max_batch_size: 1 + max_num_tokens: 8224 + max_seq_len: 8232 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: true + moe_config: + backend: WIDEEP + max_num_tokens: 16384 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.3 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8256 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 4 + max_num_tokens: 128 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: WIDEEP + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8256 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "32" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen2_tep16_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen2_tep16_batch32_eplb0_mtp3.yaml new file mode 100644 index 000000000..ecf82c12b --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen2_tep16_batch32_eplb0_mtp3.yaml @@ -0,0 +1,109 @@ +name: h100_8k1k_ctx1dep16_gen2tep16_batch32_eplb0_mtp3 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 2 + decode_nodes: 4 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 1 + max_num_tokens: 8224 + max_seq_len: 8232 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + max_num_tokens: 16384 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.3 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8256 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 32 + max_num_tokens: 256 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8256 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "48" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml new file mode 100644 index 000000000..221dfc3f7 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch1_eplb0_mtp3.yaml @@ -0,0 +1,105 @@ +name: h100_8k1k_ctx1dep16_gen3tep16_batch1_eplb0_mtp3 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 1 + max_num_tokens: 8224 + max_seq_len: 8232 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: true + moe_config: + backend: WIDEEP + max_num_tokens: 16384 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.3 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8256 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 256 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8256 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml new file mode 100644 index 000000000..3b6a18fe6 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch2_eplb0_mtp3.yaml @@ -0,0 +1,105 @@ +name: h100_8k1k_ctx1dep16_gen3tep16_batch2_eplb0_mtp3 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 1 + max_num_tokens: 8224 + max_seq_len: 8232 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: true + moe_config: + backend: WIDEEP + max_num_tokens: 16384 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.3 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8256 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 2 + max_num_tokens: 256 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8256 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml new file mode 100644 index 000000000..baf2c1e0d --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx1_gen3_tep16_batch8_eplb0_mtp3.yaml @@ -0,0 +1,106 @@ +name: h100_8k1k_ctx1dep16_gen3tep16_batch8_eplb0_mtp3 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + UCX_CUDA_IPC_ENABLE_MNNVL: n + trtllm_config: + prefill: + max_batch_size: 1 + max_num_tokens: 8224 + max_seq_len: 8232 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + max_num_tokens: 16384 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.3 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8256 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 256 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8256 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx2_gen1_dep16_batch8_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx2_gen1_dep16_batch8_eplb0_mtp3.yaml new file mode 100644 index 000000000..8be542e76 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/mtp/ctx2_gen1_dep16_batch8_eplb0_mtp3.yaml @@ -0,0 +1,108 @@ +name: h100_8k1k_ctx2dep16_gen1dep16_batch8_eplb0_mtp3 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 2 + prefill_nodes: 4 + decode_workers: 1 + decode_nodes: 2 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + trtllm_config: + prefill: + max_batch_size: 1 + max_num_tokens: 8224 + max_seq_len: 8232 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: true + moe_config: + backend: WIDEEP + max_num_tokens: 16384 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.3 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8256 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 128 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: WIDEEP + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8256 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "48" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen2_tep16_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen2_tep16_batch64_eplb0_mtp0.yaml new file mode 100644 index 000000000..0bf877f96 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen2_tep16_batch64_eplb0_mtp0.yaml @@ -0,0 +1,110 @@ + + +name: "h100_8k1k_ctx1dep16_gen2tep16_batch64_eplb0_mtp0" + +model: + path: "DeepSeek-R1-0528" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: "fp8" + +resources: + gpu_type: "h100" + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 2 + decode_nodes: 4 + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: "1" + TRTLLM_FORCE_ALLTOALL_METHOD: "DeepEP" + + decode_environment: + NCCL_NVLS_ENABLE: "0" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + + trtllm_config: + prefill: + max_batch_size: 1 + max_num_tokens: 8224 + max_seq_len: 8232 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + max_num_tokens: 16384 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.3 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8256 + + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 64 + max_num_tokens: 256 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64] + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8256 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "48" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # There are errors about colliding on port 8080, and others. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml new file mode 100644 index 000000000..b68e4f1a5 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch1_eplb0_mtp0.yaml @@ -0,0 +1,100 @@ +name: h100_8k1k_ctx1dep16_gen3tep16_batch1_eplb0_mtp0 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + trtllm_config: + prefill: + max_batch_size: 1 + max_num_tokens: 8224 + max_seq_len: 8232 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + max_num_tokens: 16384 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.3 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8256 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 1 + max_num_tokens: 256 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8256 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml new file mode 100644 index 000000000..06b713a32 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch2_eplb0_mtp0.yaml @@ -0,0 +1,110 @@ + + +name: "h100_8k1k_ctx1dep16_gen3tep16_batch2_eplb0_mtp0" + +model: + path: "DeepSeek-R1-0528" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: "fp8" + +resources: + gpu_type: "h100" + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: "1" + TRTLLM_FORCE_ALLTOALL_METHOD: "DeepEP" + + decode_environment: + NCCL_NVLS_ENABLE: "0" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + + trtllm_config: + prefill: + max_batch_size: 1 + max_num_tokens: 8224 + max_seq_len: 8232 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + max_num_tokens: 16384 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.3 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8256 + + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 2 + max_num_tokens: 256 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4] + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8256 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # There are errors about colliding on port 8080, and others. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml new file mode 100644 index 000000000..030c98654 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx1_gen3_tep16_batch8_eplb0_mtp0.yaml @@ -0,0 +1,110 @@ + + +name: "h100_8k1k_ctx1dep16_gen3tep16_batch8_eplb0_mtp0" + +model: + path: "DeepSeek-R1-0528" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: "fp8" + +resources: + gpu_type: "h100" + prefill_workers: 1 + prefill_nodes: 2 + decode_workers: 3 + decode_nodes: 6 + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: "1" + TRTLLM_FORCE_ALLTOALL_METHOD: "DeepEP" + + decode_environment: + NCCL_NVLS_ENABLE: "0" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + UCX_CUDA_IPC_ENABLE_MNNVL: "n" + + trtllm_config: + prefill: + max_batch_size: 1 + max_num_tokens: 8224 + max_seq_len: 8232 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: true + moe_config: + backend: WIDEEP + max_num_tokens: 16384 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.3 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8256 + + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 8 + max_num_tokens: 256 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8] + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8256 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 + + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "64" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # There are errors about colliding on port 8080, and others. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx2_gen1_dep16_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx2_gen1_dep16_batch16_eplb0_mtp0.yaml new file mode 100644 index 000000000..1f882bc75 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h100-fp8/8k1k/disagg/stp/ctx2_gen1_dep16_batch16_eplb0_mtp0.yaml @@ -0,0 +1,103 @@ +name: h100_8k1k_ctx2dep16_gen1dep16_batch16_eplb0_mtp0 +model: + path: DeepSeek-R1-0528 + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3" + precision: fp8 +resources: + gpu_type: h100 + prefill_workers: 2 + prefill_nodes: 4 + decode_workers: 1 + decode_nodes: 2 + gpus_per_node: 8 +backend: + type: trtllm + prefill_environment: + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + decode_environment: + NCCL_NVLS_ENABLE: '0' + UCX_CUDA_IPC_ENABLE_MNNVL: n + TRTLLM_ENABLE_PDL: '1' + TRTLLM_SERVER_DISABLE_GC: '1' + TRTLLM_WORKER_DISABLE_GC: '1' + NCCL_GRAPH_MIXING_SUPPORT: '0' + TLLM_LOG_LEVEL: INFO + TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP: '1' + TRTLLM_FORCE_ALLTOALL_METHOD: DeepEP + trtllm_config: + prefill: + max_batch_size: 1 + max_num_tokens: 8224 + max_seq_len: 8232 + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + pipeline_parallel_size: 1 + print_iter_log: true + cuda_graph_config: null + disable_overlap_scheduler: true + enable_chunked_prefill: false + moe_config: + backend: WIDEEP + max_num_tokens: 16384 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.3 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8256 + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + pipeline_parallel_size: 1 + max_batch_size: 16 + max_num_tokens: 128 + max_seq_len: 9256 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + print_iter_log: true + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + moe_config: + backend: WIDEEP + use_low_precision_moe_combine: true + cache_transceiver_config: + max_tokens_in_buffer: 8256 + backend: UCX + stream_interval: 100 + num_postprocess_workers: 4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "16" + DECODE_GPUS: "16" + TOTAL_GPUS: "48" +frontend: + type: dynamo + enable_multiple_frontends: false +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c128_ctx1_gen7_dep8_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c128_ctx1_gen7_dep8_batch128_eplb0_mtp3.yaml new file mode 100644 index 000000000..230e3a281 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c128_ctx1_gen7_dep8_batch128_eplb0_mtp3.yaml @@ -0,0 +1,113 @@ +name: "c128_ctx1_gen7_dep8_batch128_eplb0_mtp3" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_workers: 7 + decode_nodes: 7 + gpus_per_node: 8 + +backend: + type: trtllm + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + trtllm_config: + prefill: + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + enable_chunked_prefill: false + max_batch_size: 128 + max_num_tokens: 512 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1,2,4,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128] + disable_overlap_scheduler: false + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "64" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c16_ctx1_gen9_tep8_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c16_ctx1_gen9_tep8_batch128_eplb0_mtp3.yaml new file mode 100644 index 000000000..b66e9d91a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c16_ctx1_gen9_tep8_batch128_eplb0_mtp3.yaml @@ -0,0 +1,143 @@ +name: "c16_ctx1_gen9_tep8_batch128_eplb0_mtp3" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 9 + decode_nodes: 9 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (MTP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + # Decode Worker Config for Dynamo DSR1 (MTP c=16) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + enable_chunked_prefill: false + max_batch_size: 128 + max_num_tokens: 512 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "80" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c1_ctx1_gen11_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c1_ctx1_gen11_tep8_batch1_eplb0_mtp3.yaml new file mode 100644 index 000000000..246c12a61 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c1_ctx1_gen11_tep8_batch1_eplb0_mtp3.yaml @@ -0,0 +1,123 @@ +name: "c1_ctx1_gen11_tep8_batch1_eplb0_mtp3" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 11 + decode_nodes: 11 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (MTP mode, aggressive ctx:gen 1:11 for c=4) + # ISL/OSL: 1k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + # Decode Worker Config for Dynamo DSR1 (MTP c=4, TEP mode) + # ISL/OSL: 1k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + enable_chunked_prefill: false + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + disable_overlap_scheduler: false + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "96" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c256_ctx1_gen4_dep8_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c256_ctx1_gen4_dep8_batch128_eplb0_mtp3.yaml new file mode 100644 index 000000000..84c66f292 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c256_ctx1_gen4_dep8_batch128_eplb0_mtp3.yaml @@ -0,0 +1,113 @@ +name: "c256_ctx1_gen4_dep8_batch128_eplb0_mtp3" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_workers: 4 + decode_nodes: 4 + gpus_per_node: 8 + +backend: + type: trtllm + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + trtllm_config: + prefill: + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + enable_chunked_prefill: false + max_batch_size: 128 + max_num_tokens: 512 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1,2,4,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128] + disable_overlap_scheduler: false + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c32_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c32_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml new file mode 100644 index 000000000..898b6b248 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c32_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml @@ -0,0 +1,113 @@ +name: "c32_ctx1_gen11_tep8_batch128_eplb0_mtp3" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_workers: 11 + decode_nodes: 11 + gpus_per_node: 8 + +backend: + type: trtllm + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + trtllm_config: + prefill: + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + enable_chunked_prefill: false + max_batch_size: 128 + max_num_tokens: 512 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1,2,4,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128] + disable_overlap_scheduler: false + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "96" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c4_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c4_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml new file mode 100644 index 000000000..ff64103a1 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c4_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml @@ -0,0 +1,141 @@ +name: "c4_ctx1_gen11_tep8_batch128_eplb0_mtp3" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 11 + decode_nodes: 11 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (MTP mode, aggressive ctx:gen 1:11 for c=4) + # ISL/OSL: 1k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + # Decode Worker Config for Dynamo DSR1 (MTP c=4, TEP mode) + # ISL/OSL: 1k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + enable_chunked_prefill: false + max_batch_size: 128 + max_num_tokens: 512 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + disable_overlap_scheduler: false + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "96" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c512_ctx1_gen2_dep8_batch256_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c512_ctx1_gen2_dep8_batch256_eplb0_mtp1.yaml new file mode 100644 index 000000000..04d320697 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c512_ctx1_gen2_dep8_batch256_eplb0_mtp1.yaml @@ -0,0 +1,159 @@ +name: "c512_ctx1_gen2_dep8_batch256_eplb0_mtp1" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 2 + decode_nodes: 2 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (MTP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + decode: + # Decode Worker Config for Dynamo DSR1 (MTP c=512) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + enable_chunked_prefill: false + max_batch_size: 256 + max_num_tokens: 512 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c64_ctx1_gen8_dep8_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c64_ctx1_gen8_dep8_batch128_eplb0_mtp3.yaml new file mode 100644 index 000000000..af18c65d3 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c64_ctx1_gen8_dep8_batch128_eplb0_mtp3.yaml @@ -0,0 +1,143 @@ +name: "c64_ctx1_gen8_dep8_batch128_eplb0_mtp3" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 8 + decode_nodes: 8 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (MTP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + # Decode Worker Config for Dynamo DSR1 (MTP c=64) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + enable_chunked_prefill: false + max_batch_size: 128 + max_num_tokens: 512 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "72" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c8_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c8_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml new file mode 100644 index 000000000..f0e0f9a58 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/mtp/c8_ctx1_gen11_tep8_batch128_eplb0_mtp3.yaml @@ -0,0 +1,113 @@ +name: "c8_ctx1_gen11_tep8_batch128_eplb0_mtp3" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_workers: 11 + decode_nodes: 11 + gpus_per_node: 8 + +backend: + type: trtllm + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + trtllm_config: + prefill: + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + enable_chunked_prefill: false + max_batch_size: 128 + max_num_tokens: 512 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1,2,4,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128] + disable_overlap_scheduler: false + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "96" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c128_ctx1_gen9_dep8_batch512_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c128_ctx1_gen9_dep8_batch512_eplb0_mtp0.yaml new file mode 100644 index 000000000..eaa74f374 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c128_ctx1_gen9_dep8_batch512_eplb0_mtp0.yaml @@ -0,0 +1,188 @@ +name: "c128_ctx1_gen9_dep8_batch512_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 9 + decode_nodes: 9 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (DEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + # Matches E2E standalone ctx_config.yaml + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + + decode: + # Decode Worker Config for Dynamo DSR1 (DEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + # Matches E2E standalone gen_config.yaml (DEP c=128) + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + enable_chunked_prefill: false + max_batch_size: 512 + max_num_tokens: 512 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "80" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c16_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c16_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml new file mode 100644 index 000000000..03de93867 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c16_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml @@ -0,0 +1,153 @@ +name: "c16_ctx1_gen9_tep8_batch256_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 9 + decode_nodes: 9 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (TEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + decode: + # Decode Worker Config for Dynamo DSR1 (TEP c=16) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + enable_chunked_prefill: false + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "80" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c1_ctx1_gen9_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c1_ctx1_gen9_tep8_batch1_eplb0_mtp0.yaml new file mode 100644 index 000000000..0f29aab2f --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c1_ctx1_gen9_tep8_batch1_eplb0_mtp0.yaml @@ -0,0 +1,119 @@ +name: "c1_ctx1_gen9_tep8_batch1_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 9 + decode_nodes: 9 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (TEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + decode: + # Decode Worker Config for Dynamo DSR1 (TEP c=16) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + enable_chunked_prefill: false + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "80" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c256_ctx1_gen6_dep8_batch512_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c256_ctx1_gen6_dep8_batch512_eplb0_mtp0.yaml new file mode 100644 index 000000000..4393dacf8 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c256_ctx1_gen6_dep8_batch512_eplb0_mtp0.yaml @@ -0,0 +1,107 @@ +name: "c256_ctx1_gen6_dep8_batch512_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + decode_workers: 6 + decode_nodes: 6 + gpus_per_node: 8 + +backend: + type: trtllm + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + trtllm_config: + prefill: + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + decode: + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + enable_chunked_prefill: false + max_batch_size: 512 + max_num_tokens: 512 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1,2,4,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168,176,184,192,200,208,216,224,232,240,248,256,264,272,280,288,296,304,312,320,328,336,344,352,360,368,376,384,392,400,408,416,424,432,440,448,456,464,472,480,488,496,504,512] + disable_overlap_scheduler: false + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c32_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c32_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml new file mode 100644 index 000000000..9b2d8fbf5 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c32_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml @@ -0,0 +1,153 @@ +name: "c32_ctx1_gen9_tep8_batch256_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 9 + decode_nodes: 9 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (TEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + decode: + # Decode Worker Config for Dynamo DSR1 (TEP c=16) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + enable_chunked_prefill: false + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "80" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c4_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c4_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml new file mode 100644 index 000000000..ee3a951cf --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c4_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml @@ -0,0 +1,153 @@ +name: "c4_ctx1_gen9_tep8_batch256_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 9 + decode_nodes: 9 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (TEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + decode: + # Decode Worker Config for Dynamo DSR1 (TEP c=16) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + enable_chunked_prefill: false + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "80" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c512_ctx2_gen7_dep8_batch512_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c512_ctx2_gen7_dep8_batch512_eplb0_mtp0.yaml new file mode 100644 index 000000000..6356363ac --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c512_ctx2_gen7_dep8_batch512_eplb0_mtp0.yaml @@ -0,0 +1,188 @@ +name: "c512_ctx2_gen7_dep8_batch512_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 2 + prefill_workers: 2 + + decode_workers: 7 + decode_nodes: 7 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (DEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + # Matches E2E standalone ctx_config.yaml + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + + decode: + # Decode Worker Config for Dynamo DSR1 (DEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + # Matches E2E standalone gen_config.yaml (DEP c=128) + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + enable_chunked_prefill: false + max_batch_size: 512 + max_num_tokens: 512 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "72" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c64_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c64_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml new file mode 100644 index 000000000..ce67bee55 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c64_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml @@ -0,0 +1,153 @@ +name: "c64_ctx1_gen9_tep8_batch256_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 9 + decode_nodes: 9 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (TEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + decode: + # Decode Worker Config for Dynamo DSR1 (TEP c=16) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + enable_chunked_prefill: false + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "80" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c8_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c8_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml new file mode 100644 index 000000000..a5522bdad --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/1k1k/disagg/stp/c8_ctx1_gen9_tep8_batch256_eplb0_mtp0.yaml @@ -0,0 +1,153 @@ +name: "c8_ctx1_gen9_tep8_batch256_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 9 + decode_nodes: 9 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (TEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 8 + max_num_tokens: 8192 + max_seq_len: 1064 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + decode: + # Decode Worker Config for Dynamo DSR1 (TEP c=16) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + enable_chunked_prefill: false + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 2088 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 8192 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "80" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c128_ctx2_gen1_dep8_batch32_eplb0_mtp2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c128_ctx2_gen1_dep8_batch32_eplb0_mtp2.yaml new file mode 100644 index 000000000..1ad52f9f3 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c128_ctx2_gen1_dep8_batch32_eplb0_mtp2.yaml @@ -0,0 +1,123 @@ +name: "c128_ctx2_gen1_dep8_batch32_eplb0_mtp2" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 2 + prefill_workers: 2 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (MTP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + decode: + # Decode Worker Config for Dynamo DSR1 (MTP c=128) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 32] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp2.yaml new file mode 100644 index 000000000..23ad0751a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp2.yaml @@ -0,0 +1,123 @@ +name: "c16_ctx1_gen3_tep8_batch32_eplb0_mtp2" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 3 + decode_nodes: 3 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (MTP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + decode: + # Decode Worker Config for Dynamo DSR1 (MTP c=16) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 32] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp3.yaml new file mode 100644 index 000000000..4649032a7 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp3.yaml @@ -0,0 +1,123 @@ +name: "c1_ctx1_gen7_tep8_batch1_eplb0_mtp3" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 7 + decode_nodes: 7 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (MTP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + # Decode Worker Config for Dynamo DSR1 (MTP c=4) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "64" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c256_ctx3_gen1_dep8_batch32_eplb0_mtp2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c256_ctx3_gen1_dep8_batch32_eplb0_mtp2.yaml new file mode 100644 index 000000000..92ed944df --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c256_ctx3_gen1_dep8_batch32_eplb0_mtp2.yaml @@ -0,0 +1,123 @@ +name: "c256_ctx3_gen1_dep8_batch32_eplb0_mtp2" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 3 + prefill_workers: 3 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (MTP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + decode: + # Decode Worker Config for Dynamo DSR1 (MTP c=256) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 32] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c32_ctx3_gen5_tep8_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c32_ctx3_gen5_tep8_batch32_eplb0_mtp3.yaml new file mode 100644 index 000000000..01616d163 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c32_ctx3_gen5_tep8_batch32_eplb0_mtp3.yaml @@ -0,0 +1,123 @@ +name: "c32_ctx3_gen5_tep8_batch32_eplb0_mtp3" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 3 + prefill_workers: 3 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (MTP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + # Decode Worker Config for Dynamo DSR1 (MTP c=32) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 32] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "64" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp3.yaml new file mode 100644 index 000000000..78cc69344 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp3.yaml @@ -0,0 +1,123 @@ +name: "c4_ctx1_gen7_tep8_batch32_eplb0_mtp3" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 7 + decode_nodes: 7 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (MTP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + # Decode Worker Config for Dynamo DSR1 (MTP c=4) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 32] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "64" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c512_ctx3_gen1_dep8_batch64_eplb0_mtp1.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c512_ctx3_gen1_dep8_batch64_eplb0_mtp1.yaml new file mode 100644 index 000000000..607011f5c --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c512_ctx3_gen1_dep8_batch64_eplb0_mtp1.yaml @@ -0,0 +1,123 @@ +name: "c512_ctx3_gen1_dep8_batch64_eplb0_mtp1" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 3 + prefill_workers: 3 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (MTP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + decode: + # Decode Worker Config for Dynamo DSR1 (MTP c=512) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 64 + max_num_tokens: 256 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 32, 64] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c64_ctx1_gen1_dep8_batch32_eplb0_mtp2.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c64_ctx1_gen1_dep8_batch32_eplb0_mtp2.yaml new file mode 100644 index 000000000..02db00cb0 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c64_ctx1_gen1_dep8_batch32_eplb0_mtp2.yaml @@ -0,0 +1,123 @@ +name: "c64_ctx1_gen1_dep8_batch32_eplb0_mtp2" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (MTP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + decode: + # Decode Worker Config for Dynamo DSR1 (MTP c=64) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 32] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c8_ctx1_gen6_tep8_batch32_eplb0_mtp3.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c8_ctx1_gen6_tep8_batch32_eplb0_mtp3.yaml new file mode 100644 index 000000000..89cefb58e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/mtp/c8_ctx1_gen6_tep8_batch32_eplb0_mtp3.yaml @@ -0,0 +1,123 @@ +name: "c8_ctx1_gen6_tep8_batch32_eplb0_mtp3" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 6 + decode_nodes: 6 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (MTP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + decode: + # Decode Worker Config for Dynamo DSR1 (MTP c=8) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 32] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c128_ctx1_gen1_dep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c128_ctx1_gen1_dep8_batch256_eplb0_mtp0.yaml new file mode 100644 index 000000000..6f9e2c92e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c128_ctx1_gen1_dep8_batch256_eplb0_mtp0.yaml @@ -0,0 +1,120 @@ +name: "c128_ctx1_gen1_dep8_batch256_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (DEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + # Matches E2E standalone ctx_config.yaml + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + + decode: + # Decode Worker Config for Dynamo DSR1 (DEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + # Matches E2E standalone gen_config.yaml (DEP c=128) + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp0.yaml new file mode 100644 index 000000000..a7cc5137e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c16_ctx1_gen3_tep8_batch32_eplb0_mtp0.yaml @@ -0,0 +1,117 @@ +name: "c16_ctx1_gen3_tep8_batch32_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 3 + decode_nodes: 3 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (TEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + decode: + # Decode Worker Config for Dynamo DSR1 (TEP c=16) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 32] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp0.yaml new file mode 100644 index 000000000..82064a374 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c1_ctx1_gen7_tep8_batch1_eplb0_mtp0.yaml @@ -0,0 +1,117 @@ +name: "c1_ctx1_gen7_tep8_batch1_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 7 + decode_nodes: 7 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (TEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + decode: + # Decode Worker Config for Dynamo DSR1 (TEP c=4) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "64" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c256_ctx5_gen3_dep8_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c256_ctx5_gen3_dep8_batch256_eplb0_mtp0.yaml new file mode 100644 index 000000000..da13164cd --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c256_ctx5_gen3_dep8_batch256_eplb0_mtp0.yaml @@ -0,0 +1,117 @@ +name: "c256_ctx5_gen3_dep8_batch256_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 5 + prefill_workers: 5 + + decode_workers: 3 + decode_nodes: 3 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (DEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + decode: + # Decode Worker Config for Dynamo DSR1 (DEP c=256) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "64" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c32_ctx2_gen5_tep8_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c32_ctx2_gen5_tep8_batch128_eplb0_mtp0.yaml new file mode 100644 index 000000000..38d63593a --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c32_ctx2_gen5_tep8_batch128_eplb0_mtp0.yaml @@ -0,0 +1,117 @@ +name: "c32_ctx2_gen5_tep8_batch128_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 2 + prefill_workers: 2 + + decode_workers: 5 + decode_nodes: 5 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (TEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + decode: + # Decode Worker Config for Dynamo DSR1 (TEP c=32) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp0.yaml new file mode 100644 index 000000000..19ba51ba6 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c4_ctx1_gen7_tep8_batch32_eplb0_mtp0.yaml @@ -0,0 +1,117 @@ +name: "c4_ctx1_gen7_tep8_batch32_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 7 + decode_nodes: 7 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (TEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + decode: + # Decode Worker Config for Dynamo DSR1 (TEP c=4) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 32] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "64" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c512_ctx3_gen1_dep8_batch512_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c512_ctx3_gen1_dep8_batch512_eplb0_mtp0.yaml new file mode 100644 index 000000000..3b35f1299 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c512_ctx3_gen1_dep8_batch512_eplb0_mtp0.yaml @@ -0,0 +1,117 @@ +name: "c512_ctx3_gen1_dep8_batch512_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 3 + prefill_workers: 3 + + decode_workers: 1 + decode_nodes: 1 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (DEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + decode: + # Decode Worker Config for Dynamo DSR1 (DEP c=512) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 512 + max_num_tokens: 512 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c64_ctx2_gen3_dep8_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c64_ctx2_gen3_dep8_batch128_eplb0_mtp0.yaml new file mode 100644 index 000000000..531f573f3 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c64_ctx2_gen3_dep8_batch128_eplb0_mtp0.yaml @@ -0,0 +1,117 @@ +name: "c64_ctx2_gen3_dep8_batch128_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 2 + prefill_workers: 2 + + decode_workers: 3 + decode_nodes: 3 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (DEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + decode: + # Decode Worker Config for Dynamo DSR1 (DEP c=64) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_chunked_prefill: false + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c8_ctx1_gen6_tep8_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c8_ctx1_gen6_tep8_batch16_eplb0_mtp0.yaml new file mode 100644 index 000000000..c8a885d95 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/h200-fp8/8k1k/disagg/stp/c8_ctx1_gen6_tep8_batch16_eplb0_mtp0.yaml @@ -0,0 +1,117 @@ +name: "c8_ctx1_gen6_tep8_batch16_eplb0_mtp0" + +model: + path: "dsr1" + container: "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1" + precision: "fp8" + +sbatch_directives: + cpus-per-gpu: "16" + +resources: + gpu_type: "h200" + prefill_nodes: 1 + prefill_workers: 1 + + decode_workers: 6 + decode_nodes: 6 + + gpus_per_node: 8 + +backend: + type: trtllm + + prefill_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + decode_environment: + UCX_TLS: "rc,dc,ud,cuda_copy,cuda_ipc,gdr_copy,tcp" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + NCCL_GRAPH_MIXING_SUPPORT: "0" + + trtllm_config: + prefill: + # Prefill Worker Config for Dynamo DSR1 (TEP mode) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 32768 + moe_config: + backend: CUTLASS + cuda_graph_config: null + disable_overlap_scheduler: true + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + decode: + # Decode Worker Config for Dynamo DSR1 (TEP c=8) + # ISL/OSL: 8k/1k, TP=8 on H200 + backend: pytorch + trust_remote_code: true + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_chunked_prefill: false + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + kv_cache_config: + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + dtype: fp8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + moe_config: + backend: CUTLASS + use_low_precision_moe_combine: true + cuda_graph_config: + enable_padding: true + batch_sizes: [1, 2, 4, 8, 16] + disable_overlap_scheduler: false + print_iter_log: true + # Performance tuning + stream_interval: 100 + num_postprocess_workers: 4 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "56" + +frontend: + type: "dynamo" + enable_multiple_frontends: false # For some reason, the H200 cluster doesn't like nginx. + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml new file mode 100644 index 000000000..b6eca4631 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml @@ -0,0 +1,154 @@ +name: "svf-vllm-disagg-gb200-mid-curve" + +# Mirrored from NVIDIA/srt-slurm aflowers/vllm-gb200-v0.20.0 branch: +# recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-mid-curve.yaml +# +# Topology: 1 prefill (DEP=8) + 1 decode (DEP=8). 5 nodes total with a +# dedicated NATS/etcd infra node. Mid-curve point at concurrency 256. +# +# Local deltas vs upstream: +# * model.path alias renamed deepseekv4-fp4 -> deepseek-v4-pro to match +# SRT_SLURM_MODEL_PREFIX in runners/launch_gb200-nv.sh. +# * model.container set to vllm/vllm-openai:v0.20.0-ubuntu2404 to +# match nvidia-master.yaml image (which the launch script registers as +# the alias key in srtslurm.yaml). Upstream variants ship either the +# non-dynamo floating tag or a sha256 pin. +# * slurm.time_limit + health_check set to 8h / 1440 attempts to +# absorb cold-cache /mnt/numa1 model loads. +model: + path: "deepseek-v4-pro" + container: "vllm/vllm-openai:v0.20.0-ubuntu2404" + precision: "fp4" + +dynamo: + install: true + wheel: "1.2.0.dev20260426" + +setup_script: vllm-container-deps.sh + +slurm: + time_limit: "8:00:00" + +health_check: + max_attempts: 1440 + interval_seconds: 10 +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 2 + decode_nodes: 2 + prefill_workers: 1 + decode_workers: 1 + gpus_per_prefill: 8 + gpus_per_decode: 8 + +infra: + etcd_nats_dedicated_node: true + +frontend: + type: dynamo + enable_multiple_frontends: false +backend: + type: vllm + connector: null + prefill_environment: + TILELANG_CLEANUP_TEMP_FILES: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + VLLM_SERVER_DEV_MODE: "1" + VLLM_SPARSE_INDEXER_MAX_LOGITS_MB: "1024" + VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE: "2048" + # VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1" + # VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random" + UCX_MEMTYPE_CACHE: "n" + UCX_MEMTYPE_REG_WHOLE: "n" + UCX_TLS: "cuda_copy,cuda_ipc,tcp" + UCX_CUDA_IPC_ENABLE_MNNVL: "y" + NCCL_P2P_LEVEL: NVL + decode_environment: + TILELANG_CLEANUP_TEMP_FILES: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + VLLM_SERVER_DEV_MODE: "1" + # VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1" + # VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random" + UCX_MEMTYPE_CACHE: "n" + UCX_MEMTYPE_REG_WHOLE: "n" + UCX_TLS: "cuda_copy,cuda_ipc,tcp" + UCX_CUDA_IPC_ENABLE_MNNVL: "y" + NCCL_P2P_LEVEL: NVL + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 8 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + enforce-eager: true + max-model-len: 9280 + max-num-seqs: 16 + max-num-batched-tokens: 32768 + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-flashinfer-autotune: true + no-async-scheduling: true + block-size: 256 + gpu-memory-utilization: 0.8 + no-disable-hybrid-kv-cache-manager: true + enable-sleep-mode: true + numa-bind: true + offload-group-size: 3 + offload-num-in-group: 1 + offload-prefetch-step: 2 + # offload-params: "w13_weight w2_weight w13_weight_scale w2_weight_scale wq_b wo_a wo_b shared_experts" + tokenizer-mode: deepseek_v4 + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 8 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 9280 + max-num-seqs: 128 + max-cudagraph-capture-size: 128 + max-num-batched-tokens: 128 + trust-remote-code: true + no-enable-prefix-caching: true + block-size: 256 + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + no-disable-hybrid-kv-cache-manager: true + enable-sleep-mode: true + tokenizer-mode: deepseek_v4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-V4-Pro" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + DSV4: "true" + +identity: + container: + image: "vllm/vllm-openai:v0.20.0-ubuntu2404" + frameworks: + dynamo: "1.2.0.dev20260426" + vllm: "0.20.0" diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml new file mode 100644 index 000000000..2f0fa98e6 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml @@ -0,0 +1,150 @@ +name: "dsv4-vllm-disagg-gb200-2p1d-dep8-dep8-offload" + +# Mirrored from NVIDIA/srt-slurm aflowers/gb200-dsv4-recipes branch (PR #77): +# recipes/vllm/deepseek-v4-pro-sa/8k1k/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml +# +# Topology: 2 prefill (DEP=8 each) + 1 decode (DEP=8). 6 nodes. +# c4096-tuned variant (decode max-num-seqs=512). +# +# Local deltas vs upstream: +# * model.path alias renamed deepseekv4-fp4 -> deepseek-v4-pro to match +# SRT_SLURM_MODEL_PREFIX in runners/launch_gb200-nv.sh. +# * model.container set to vllm/vllm-openai:v0.20.0-ubuntu2404 to +# match nvidia-master.yaml image (which the launch script registers as +# the alias key in srtslurm.yaml). Upstream variants ship either the +# non-dynamo floating tag or a sha256 pin. +# * slurm.time_limit + health_check set to 8h / 1440 attempts to +# absorb cold-cache /mnt/numa1 model loads. +model: + path: "deepseek-v4-pro" + container: "vllm/vllm-openai:v0.20.0-ubuntu2404" + precision: "fp4" + +dynamo: + install: true + wheel: "1.2.0.dev20260426" + +setup_script: vllm-container-deps.sh + +slurm: + time_limit: "8:00:00" + +health_check: + max_attempts: 1440 + interval_seconds: 10 +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 4 + decode_nodes: 2 + prefill_workers: 2 + decode_workers: 1 + gpus_per_prefill: 8 + gpus_per_decode: 8 +frontend: + type: dynamo + enable_multiple_frontends: false +backend: + type: vllm + connector: null + prefill_environment: + TILELANG_CLEANUP_TEMP_FILES: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + VLLM_SERVER_DEV_MODE: "1" + VLLM_SPARSE_INDEXER_MAX_LOGITS_MB: "1024" + VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE: "2048" + # VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1" + # VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random" + UCX_MEMTYPE_CACHE: "n" + UCX_MEMTYPE_REG_WHOLE: "n" + UCX_TLS: "cuda_copy,cuda_ipc,tcp" + UCX_CUDA_IPC_ENABLE_MNNVL: "y" + NCCL_P2P_LEVEL: NVL + decode_environment: + TILELANG_CLEANUP_TEMP_FILES: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + VLLM_SERVER_DEV_MODE: "1" + # VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1" + # VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random" + UCX_MEMTYPE_CACHE: "n" + UCX_MEMTYPE_REG_WHOLE: "n" + UCX_TLS: "cuda_copy,cuda_ipc,tcp" + UCX_CUDA_IPC_ENABLE_MNNVL: "y" + NCCL_P2P_LEVEL: NVL + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 8 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + enforce-eager: true + max-model-len: 16384 + max-num-seqs: 16 + max-num-batched-tokens: 32768 + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-flashinfer-autotune: true + no-async-scheduling: true + block-size: 256 + gpu-memory-utilization: 0.8 + no-disable-hybrid-kv-cache-manager: true + enable-sleep-mode: true + numa-bind: true + offload-group-size: 3 + offload-num-in-group: 1 + offload-prefetch-step: 2 + # offload-params: "w13_weight w2_weight w13_weight_scale w2_weight_scale wq_b wo_a wo_b shared_experts" + tokenizer-mode: deepseek_v4 + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 8 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 16384 + max-num-seqs: 512 + max-cudagraph-capture-size: 512 + max-num-batched-tokens: 512 + trust-remote-code: true + no-enable-prefix-caching: true + block-size: 256 + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + no-disable-hybrid-kv-cache-manager: true + enable-sleep-mode: true + tokenizer-mode: deepseek_v4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-V4-Pro" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" + DSV4: "true" + +identity: + container: + image: "vllm/vllm-openai:v0.20.0-ubuntu2404" + frameworks: + dynamo: "1.2.0.dev20260426" + vllm: "0.20.0" diff --git a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml similarity index 90% rename from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml rename to benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml index 9848edb01..2f0fa98e6 100644 --- a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml +++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml @@ -127,14 +127,20 @@ backend: no-disable-hybrid-kv-cache-manager: true enable-sleep-mode: true tokenizer-mode: deepseek_v4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + benchmark: - type: "sa-bench" - isl: 8192 - osl: 1024 - concurrencies: "4096" - req_rate: "inf" - use_chat_template: true - custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer" + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-V4-Pro" + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "24" + DSV4: "true" identity: container: diff --git a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml similarity index 90% rename from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml rename to benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml index 3f3803d3b..85ff907e3 100644 --- a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml +++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml @@ -127,14 +127,20 @@ backend: no-disable-hybrid-kv-cache-manager: true enable-sleep-mode: true tokenizer-mode: deepseek_v4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + benchmark: - type: "sa-bench" - isl: 8192 - osl: 1024 - concurrencies: "4096" - req_rate: "inf" - use_chat_template: true - custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer" + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-V4-Pro" + PREFILL_GPUS: "8" + DECODE_GPUS: "16" + TOTAL_GPUS: "40" + DSV4: "true" identity: container: diff --git a/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16.yaml new file mode 100644 index 000000000..85ff907e3 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16.yaml @@ -0,0 +1,150 @@ +name: "dsv4-vllm-disagg-gb200-3p1d-dep8-dep16-offload" + +# Mirrored from NVIDIA/srt-slurm aflowers/gb200-dsv4-recipes branch (PR #77): +# recipes/vllm/deepseek-v4-pro-sa/8k1k/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml +# +# Topology: 3 prefill (DEP=8) + 1 wide decode (DEP=16). 10 nodes. +# c4096-tuned variant. +# +# Local deltas vs upstream: +# * model.path alias renamed deepseekv4-fp4 -> deepseek-v4-pro to match +# SRT_SLURM_MODEL_PREFIX in runners/launch_gb200-nv.sh. +# * model.container set to vllm/vllm-openai:v0.20.0-ubuntu2404 to +# match nvidia-master.yaml image (which the launch script registers as +# the alias key in srtslurm.yaml). Upstream variants ship either the +# non-dynamo floating tag or a sha256 pin. +# * slurm.time_limit + health_check set to 8h / 1440 attempts to +# absorb cold-cache /mnt/numa1 model loads. +model: + path: "deepseek-v4-pro" + container: "vllm/vllm-openai:v0.20.0-ubuntu2404" + precision: "fp4" + +dynamo: + install: true + wheel: "1.2.0.dev20260426" + +setup_script: vllm-container-deps.sh + +slurm: + time_limit: "8:00:00" + +health_check: + max_attempts: 1440 + interval_seconds: 10 +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 6 + decode_nodes: 4 + prefill_workers: 3 + decode_workers: 1 + gpus_per_prefill: 8 + gpus_per_decode: 16 +frontend: + type: dynamo + enable_multiple_frontends: false +backend: + type: vllm + connector: null + prefill_environment: + TILELANG_CLEANUP_TEMP_FILES: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + VLLM_SERVER_DEV_MODE: "1" + VLLM_SPARSE_INDEXER_MAX_LOGITS_MB: "1024" + VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE: "2048" + # VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1" + # VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random" + UCX_MEMTYPE_CACHE: "n" + UCX_MEMTYPE_REG_WHOLE: "n" + UCX_TLS: "cuda_copy,cuda_ipc,tcp" + UCX_CUDA_IPC_ENABLE_MNNVL: "y" + NCCL_P2P_LEVEL: NVL + decode_environment: + TILELANG_CLEANUP_TEMP_FILES: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + VLLM_SERVER_DEV_MODE: "1" + # VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1" + # VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random" + UCX_MEMTYPE_CACHE: "n" + UCX_MEMTYPE_REG_WHOLE: "n" + UCX_TLS: "cuda_copy,cuda_ipc,tcp" + UCX_CUDA_IPC_ENABLE_MNNVL: "y" + NCCL_P2P_LEVEL: NVL + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 8 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + enforce-eager: true + max-model-len: 16384 + max-num-seqs: 16 + max-num-batched-tokens: 32768 + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-flashinfer-autotune: true + no-async-scheduling: true + block-size: 256 + gpu-memory-utilization: 0.8 + no-disable-hybrid-kv-cache-manager: true + enable-sleep-mode: true + numa-bind: true + offload-group-size: 3 + offload-num-in-group: 1 + offload-prefetch-step: 2 + # offload-params: "w13_weight w2_weight w13_weight_scale w2_weight_scale wq_b wo_a wo_b shared_experts" + tokenizer-mode: deepseek_v4 + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 16 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 16384 + max-num-seqs: 256 + max-cudagraph-capture-size: 256 + max-num-batched-tokens: 256 + trust-remote-code: true + no-enable-prefix-caching: true + block-size: 256 + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + no-disable-hybrid-kv-cache-manager: true + enable-sleep-mode: true + tokenizer-mode: deepseek_v4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + MODEL_NAME: "deepseek-ai/DeepSeek-V4-Pro" + PREFILL_GPUS: "8" + DECODE_GPUS: "16" + TOTAL_GPUS: "40" + DSV4: "true" + +identity: + container: + image: "vllm/vllm-openai:v0.20.0-ubuntu2404" + frameworks: + dynamo: "1.2.0.dev20260426" + vllm: "0.20.0" diff --git a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-latency.yaml similarity index 91% rename from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml rename to benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-latency.yaml index 137e3017a..b6e334b02 100644 --- a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml +++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-latency.yaml @@ -131,14 +131,19 @@ backend: no-disable-hybrid-kv-cache-manager: true enable-sleep-mode: true tokenizer-mode: deepseek_v4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + benchmark: - type: "sa-bench" - isl: 8192 - osl: 1024 - concurrencies: "1" - req_rate: "inf" - use_chat_template: true - custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer" + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + DSV4: "true" identity: container: diff --git a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-middle-curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-middle-curve.yaml similarity index 91% rename from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-middle-curve.yaml rename to benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-middle-curve.yaml index 20672bfdf..3d924449d 100644 --- a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-middle-curve.yaml +++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-middle-curve.yaml @@ -133,14 +133,19 @@ backend: no-disable-hybrid-kv-cache-manager: true enable-sleep-mode: true tokenizer-mode: deepseek_v4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + benchmark: - type: "sa-bench" - isl: 8192 - osl: 1024 - concurrencies: "256x512" - req_rate: "inf" - use_chat_template: true - custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer" + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "40" + DSV4: "true" identity: container: diff --git a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt-megamoe.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt-megamoe.yaml similarity index 92% rename from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt-megamoe.yaml rename to benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt-megamoe.yaml index fe3840109..e749199ed 100644 --- a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt-megamoe.yaml +++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt-megamoe.yaml @@ -134,14 +134,19 @@ backend: no-disable-hybrid-kv-cache-manager: true enable-sleep-mode: true tokenizer-mode: deepseek_v4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + benchmark: - type: "sa-bench" - isl: 8192 - osl: 1024 - concurrencies: "4096" - req_rate: "inf" - use_chat_template: true - custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer" + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + DSV4: "true" identity: model: diff --git a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt.yaml similarity index 91% rename from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt.yaml rename to benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt.yaml index 754d61662..d6b2c11f2 100644 --- a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-max-tpt.yaml +++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt.yaml @@ -131,14 +131,19 @@ backend: no-disable-hybrid-kv-cache-manager: true enable-sleep-mode: true tokenizer-mode: deepseek_v4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + benchmark: - type: "sa-bench" - isl: 8192 - osl: 1024 - concurrencies: "4096" - req_rate: "inf" - use_chat_template: true - custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer" + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "32" + DSV4: "true" identity: container: diff --git a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-mid-curve.yaml b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-mid-curve.yaml similarity index 91% rename from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-mid-curve.yaml rename to benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-mid-curve.yaml index bf8e6c452..0e40d5d40 100644 --- a/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-mid-curve.yaml +++ b/benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-mid-curve.yaml @@ -131,14 +131,19 @@ backend: no-disable-hybrid-kv-cache-manager: true enable-sleep-mode: true tokenizer-mode: deepseek_v4 +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + benchmark: - type: "sa-bench" - isl: 8192 - osl: 1024 - concurrencies: "256" - req_rate: "inf" - use_chat_template: true - custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer" + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "8" + DECODE_GPUS: "8" + TOTAL_GPUS: "16" + DSV4: "true" identity: container: diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml new file mode 100644 index 000000000..49a38528d --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml @@ -0,0 +1,131 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep16_batch32_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32 +# STP (no speculative decoding) +# concurrency: 666 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "20" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml new file mode 100644 index 000000000..c83b4c67b --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml @@ -0,0 +1,135 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep32_batch64_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=64 +# STP (no speculative decoding) +# concurrency: 2253 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + enable_multiple_frontends: true + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml new file mode 100644 index 000000000..e5a833580 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml @@ -0,0 +1,223 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=768 +# STP (no speculative decoding) +# Covers all dep8 concurrencies: 4301, 6452 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 768 + max_num_tokens: 768 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + - 520 + - 528 + - 536 + - 544 + - 552 + - 560 + - 568 + - 576 + - 584 + - 592 + - 600 + - 608 + - 616 + - 624 + - 632 + - 640 + - 648 + - 656 + - 664 + - 672 + - 680 + - 688 + - 696 + - 704 + - 712 + - 720 + - 728 + - 736 + - 744 + - 752 + - 760 + - 768 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "12" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml new file mode 100644 index 000000000..a56150450 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml @@ -0,0 +1,144 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=128 +# STP (no speculative decoding) +# Covers all gen4tep8 concurrencies: 4, 192, 360, 668 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + allreduce_strategy: MNNVL + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml new file mode 100644 index 000000000..ffb109b8d --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml @@ -0,0 +1,128 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 5 decode workers, TP4/EP4, max_batch=8 +# STP (no speculative decoding) +# Covers all gen5tep4 concurrencies: 5, 15, 30, 55 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 8 + max_num_tokens: 8 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "4" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml new file mode 100644 index 000000000..f75876142 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml @@ -0,0 +1,159 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch256_eplb0_mtp0" + +# ctx: 2 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=256 +# STP (no speculative decoding) +# concurrency: 4301 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml new file mode 100644 index 000000000..7fdf9daea --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/1k1k/disagg/stp/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml @@ -0,0 +1,143 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep32_batch128_eplb0_mtp0" + +# ctx: 2 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=128 +# STP (no speculative decoding) +# concurrency: 4301 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "32" + TOTAL_GPUS: "40" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml new file mode 100644 index 000000000..bbc7627ee --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml @@ -0,0 +1,132 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 4 decode workers, TP4/EP4, max_batch=32 +# Single concurrency point: 156 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + # Prefill: 1 worker x TP4 = 4 GPUs = 1 node + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + # Decode: 4 workers x TP4 = 16 GPUs = 4 nodes + decode_workers: 4 + decode_nodes: 4 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "4" + TOTAL_GPUS: "20" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml new file mode 100644 index 000000000..5a0b04c91 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml @@ -0,0 +1,129 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=1 +# Single concurrency point: 4 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + # Prefill: 1 worker x TP4 = 4 GPUs = 1 node + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + # Decode: 4 workers x TP8 = 32 GPUs = 8 nodes + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + allreduce_strategy: MNNVL + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "36" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml new file mode 100644 index 000000000..90d294ff5 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml @@ -0,0 +1,132 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 5 decode workers, TP4/EP4, max_batch=16 +# Covers all concurrencies: 5, 15, 30, 60, 105 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + # Prefill: 1 worker x TP4 = 4 GPUs = 1 node + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + # Decode: 5 workers x TP4 = 20 GPUs = 5 nodes + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + # max_batch_size=16 covers all concs: 5, 15, 30, 60, 105 + # cuda_graph pre-compiles graphs for each batch size up to the max + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "4" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml new file mode 100644 index 000000000..8cc508d5e --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml @@ -0,0 +1,130 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx2dep4_gen1dep16_batch16_eplb0_mtp0" + +# ctx: 2 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=16 +# concurrency: 333 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + # Prefill: 2 workers x TP4 = 8 GPUs = 2 nodes + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + # Decode: 1 worker x TP16 = 16 GPUs = 4 nodes + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "24" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml new file mode 100644 index 000000000..528b0b4f9 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml @@ -0,0 +1,132 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx3dep4_gen1dep16_batch32_eplb0_mtp0" + +# ctx: 3 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32 +# concurrency: 615 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + # Prefill: 3 workers x TP4 = 12 GPUs = 3 nodes + prefill_nodes: 3 + prefill_workers: 3 + gpus_per_prefill: 4 + + # Decode: 1 worker x TP16 = 16 GPUs = 4 nodes + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "28" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml new file mode 100644 index 000000000..d0dbf80f0 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml @@ -0,0 +1,161 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0" + +# ctx: 5 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=256 +# Single concurrency point: 2151 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + # Prefill: 5 workers x TP4 = 20 GPUs = 5 nodes + prefill_nodes: 5 + prefill_workers: 5 + gpus_per_prefill: 4 + + # Decode: 1 worker x TP8 = 8 GPUs = 2 nodes + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + # max_batch_size=256, cuda_graph pre-compiles graphs for all batch sizes up to 256 + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "28" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml new file mode 100644 index 000000000..6eb391bba --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/trtllm/gb200-fp4/8k1k/disagg/stp/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml @@ -0,0 +1,144 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx7dep4_gen1dep16_batch128_eplb0_mtp0" + +# ctx: 7 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128 +# concurrency: 2253 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + # Prefill: 7 workers x TP4 = 28 GPUs = 7 nodes + prefill_nodes: 7 + prefill_workers: 7 + gpus_per_prefill: 4 + + # Decode: 1 worker x TP16 = 16 GPUs = 4 nodes + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "44" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep4-dep16.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep4-dep16.yaml new file mode 100644 index 000000000..c5230d9e5 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep4-dep16.yaml @@ -0,0 +1,107 @@ +name: "kimi-vllm-disagg-gb200-1p1d-dep4-dep16" + +model: + path: "kimi-k2.5-nvfp4" + container: "vllm/vllm-openai:v0.18.0-cu130" + precision: "fp4" + +dynamo: + version: 1.0.1 + install: true + +setup_script: vllm-container-deps.sh + +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 1 + decode_nodes: 4 + prefill_workers: 1 + decode_workers: 1 + gpus_per_prefill: 4 + gpus_per_decode: 16 + +frontend: + type: dynamo + enable_multiple_frontends: false + +backend: + type: vllm + connector: null + + prefill_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + decode_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 4 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 3072 + max-num-seqs: 4096 + enforce-eager: true + compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + max-num-batched-tokens: 16384 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}' + all2all-backend: "flashinfer_nvlink_one_sided" + gpu-memory-utilization: 0.9 + + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 16 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 3072 + max-num-seqs: 4096 + max-num-batched-tokens: 10240 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + async-scheduling: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + all2all-backend: "flashinfer_nvlink_one_sided" + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + max-cudagraph-capture-size: 512 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "20" diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml new file mode 100644 index 000000000..0992a5091 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml @@ -0,0 +1,104 @@ +name: "kimi-vllm-disagg-gb200-1p4d-dep4-tep4" + +model: + path: "kimi-k2.5-nvfp4" + container: "vllm/vllm-openai:v0.18.0-cu130" + precision: "fp4" + +dynamo: + version: 1.0.1 + install: true + +setup_script: vllm-container-deps.sh + +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 1 + decode_nodes: 4 + prefill_workers: 1 + decode_workers: 4 + gpus_per_prefill: 4 + gpus_per_decode: 4 + +frontend: + type: dynamo + enable_multiple_frontends: false + +backend: + type: vllm + connector: null + + prefill_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + decode_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 4 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 3072 + max-num-seqs: 1024 + enforce-eager: true + compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + max-num-batched-tokens: 16384 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}' + all2all-backend: "flashinfer_nvlink_one_sided" + gpu-memory-utilization: 0.9 + + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 4 + pipeline-parallel-size: 1 + enable-expert-parallel: true + max-model-len: 3072 + max-num-seqs: 1024 + max-num-batched-tokens: 10240 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + async-scheduling: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + max-cudagraph-capture-size: 1024 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "4" + TOTAL_GPUS: "20" diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml new file mode 100644 index 000000000..5670a9d54 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p4d-dep4-tep4.yaml @@ -0,0 +1,104 @@ +name: "kimi-vllm-disagg-gb200-1p4d-dep4-tep4" + +model: + path: "kimi-k2.5-nvfp4" + container: "vllm/vllm-openai:v0.18.0-cu130" + precision: "fp4" + +dynamo: + version: 1.0.1 + install: true + +setup_script: vllm-container-deps.sh + +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 1 + decode_nodes: 4 + prefill_workers: 1 + decode_workers: 4 + gpus_per_prefill: 4 + gpus_per_decode: 4 + +frontend: + type: dynamo + enable_multiple_frontends: false + +backend: + type: vllm + connector: null + + prefill_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + decode_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 4 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 64 + enforce-eager: true + compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + max-num-batched-tokens: 16384 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}' + all2all-backend: "flashinfer_nvlink_one_sided" + gpu-memory-utilization: 0.9 + + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 4 + pipeline-parallel-size: 1 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 16 + max-num-batched-tokens: 10240 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + async-scheduling: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + max-cudagraph-capture-size: 16 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "4" + TOTAL_GPUS: "20" diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep4-dep16.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep4-dep16.yaml new file mode 100644 index 000000000..cecacdfd7 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep4-dep16.yaml @@ -0,0 +1,107 @@ +name: "kimi-vllm-disagg-gb200-3p1d-dep4-dep16" + +model: + path: "kimi-k2.5-nvfp4" + container: "vllm/vllm-openai:v0.18.0-cu130" + precision: "fp4" + +dynamo: + version: 1.0.1 + install: true + +setup_script: vllm-container-deps.sh + +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 3 + decode_nodes: 4 + prefill_workers: 3 + decode_workers: 1 + gpus_per_prefill: 4 + gpus_per_decode: 16 + +frontend: + type: dynamo + enable_multiple_frontends: false + +backend: + type: vllm + connector: null + + prefill_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + decode_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 4 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 64 + enforce-eager: true + compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + max-num-batched-tokens: 16384 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}' + all2all-backend: "flashinfer_nvlink_one_sided" + gpu-memory-utilization: 0.9 + + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 16 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 256 + max-num-batched-tokens: 10240 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + async-scheduling: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + all2all-backend: "flashinfer_nvlink_one_sided" + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + max-cudagraph-capture-size: 256 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "28" diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-5p1d-dep4-dep8.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-5p1d-dep4-dep8.yaml new file mode 100644 index 000000000..259db9436 --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-5p1d-dep4-dep8.yaml @@ -0,0 +1,107 @@ +name: "kimi-vllm-disagg-gb200-5p1d-dep4-dep8" + +model: + path: "kimi-k2.5-nvfp4" + container: "vllm/vllm-openai:v0.18.0-cu130" + precision: "fp4" + +dynamo: + version: 1.0.1 + install: true + +setup_script: vllm-container-deps.sh + +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 5 + decode_nodes: 2 + prefill_workers: 5 + decode_workers: 1 + gpus_per_prefill: 4 + gpus_per_decode: 8 + +frontend: + type: dynamo + enable_multiple_frontends: false + +backend: + type: vllm + connector: null + + prefill_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + decode_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 4 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 64 + enforce-eager: true + compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + max-num-batched-tokens: 16384 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}' + all2all-backend: "flashinfer_nvlink_one_sided" + gpu-memory-utilization: 0.9 + + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 8 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 512 + max-num-batched-tokens: 10240 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + async-scheduling: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + all2all-backend: "flashinfer_nvlink_one_sided" + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + max-cudagraph-capture-size: 512 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "8" + TOTAL_GPUS: "28" diff --git a/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-6p1d-dep4-dep16.yaml b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-6p1d-dep4-dep16.yaml new file mode 100644 index 000000000..0a26d118d --- /dev/null +++ b/benchmarks/multi_node/srt-slurm-recipes/kimik2.5/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-6p1d-dep4-dep16.yaml @@ -0,0 +1,107 @@ +name: "kimi-vllm-disagg-gb200-6p1d-dep4-dep16" + +model: + path: "kimi-k2.5-nvfp4" + container: "vllm/vllm-openai:v0.18.0-cu130" + precision: "fp4" + +dynamo: + version: 1.0.1 + install: true + +setup_script: vllm-container-deps.sh + +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 6 + decode_nodes: 4 + prefill_workers: 6 + decode_workers: 1 + gpus_per_prefill: 4 + gpus_per_decode: 16 + +frontend: + type: dynamo + enable_multiple_frontends: false + +backend: + type: vllm + connector: null + + prefill_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + decode_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 4 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 64 + enforce-eager: true + compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + max-num-batched-tokens: 16384 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}' + all2all-backend: "flashinfer_nvlink_one_sided" + gpu-memory-utilization: 0.9 + + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 16 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 512 + max-num-batched-tokens: 10240 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + async-scheduling: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + all2all-backend: "flashinfer_nvlink_one_sided" + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + max-cudagraph-capture-size: 512 + +# Bench client lives in this repo; mounted into the bench container at +# /infmax-workspace. See benchmarks/multi_node/srt_bench.sh for the env contract. +container_mounts: + "$INFMAX_WORKSPACE": "/infmax-workspace" + +benchmark: + type: "custom" + command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh" + env: + PREFILL_GPUS: "4" + DECODE_GPUS: "16" + TOTAL_GPUS: "40" diff --git a/benchmarks/multi_node/srt_bench.sh b/benchmarks/multi_node/srt_bench.sh new file mode 100755 index 000000000..aeb1ef502 --- /dev/null +++ b/benchmarks/multi_node/srt_bench.sh @@ -0,0 +1,127 @@ +#!/usr/bin/env bash +# Multi-node bench-serving wrapper invoked by srt-slurm via +# `benchmark.type: custom`. srt-slurm owns server bring-up; this script runs +# inside the same job's benchmark container against the already-ready +# frontend on the head node, then writes one results JSON per concurrency to +# /logs/sa-bench_isl__osl_/ — the same path the launcher's existing +# result-harvesters glob. +# +# This is a thin loop on top of run_benchmark_serving() in benchmark_lib.sh +# (the same shim every single-node bench script uses), so any future change +# to bench-serving CLI conventions, profiling, server-health monitoring, etc. +# applies here automatically. +# +# Reads from env. Most of these are *already* exported by +# .github/workflows/benchmark-multinode-tmpl.yml at the workflow step level +# and propagate down through the launcher → srtctl → srun (default +# --export=ALL) → pyxis → bench container, so recipes do not need to +# re-declare them in `benchmark.env`: +# +# $MODEL served-model-name; matches workflow `inputs.model` +# $ISL $OSL sequence lengths +# $CONC_LIST space-separated concurrency list +# $DISAGG "true" / "false" — disagg vs aggregated +# $RANDOM_RANGE_RATIO 0.8 (workflow default) +# +# Per-recipe knobs that *do* live in `benchmark.env` (no workflow equivalent): +# PREFILL_GPUS per-prefill-worker GPU count (filename component) +# DECODE_GPUS per-decode-worker GPU count (filename component) +# TOTAL_GPUS sum across all workers (filename component) +# +# Optional per-recipe overrides (defaults shown): +# MODEL_NAME=$MODEL override when server's served-model-name differs +# from the master-yaml `model:` field +# PORT=8000 frontend port reachable at localhost +# BACKEND=openai generic OpenAI-API; works against the dynamo frontend +# ENDPOINT= empty -> bench_serving.py default (/v1/completions) +# NUM_PROMPTS_MULT=10 prompts per conc = NUM_PROMPTS_MULT * conc +# USE_CHAT_TEMPLATE=true +# DSV4=false sets the --dsv4 flag (auto-enables chat template) +# TRUST_REMOTE_CODE=true +# +# The InferenceX repo is bind-mounted at /infmax-workspace via each recipe's +# `container_mounts` block. Model files are auto-mounted at /model by srtctl +# (RuntimeContext.create unconditionally adds the mount when model.path is a +# local path), so we point --tokenizer at /model to load the tokenizer from +# the same files the engine is serving — no HF Hub dependency. +set -euo pipefail + +INFMAX_WS="${INFMAX_CONTAINER_WORKSPACE:-/infmax-workspace}" +# shellcheck disable=SC1091 +source "$INFMAX_WS/benchmarks/benchmark_lib.sh" + +check_env_vars MODEL ISL OSL CONC_LIST DISAGG \ + PREFILL_GPUS DECODE_GPUS TOTAL_GPUS + +MODEL_NAME="${MODEL_NAME:-$MODEL}" +PORT="${PORT:-8000}" +# `openai` matches every dynamo frontend (frontend exposes a generic OpenAI- +# compatible API regardless of the underlying engine). Recipes that need +# /v1/chat/completions can override ENDPOINT. +BACKEND="${BACKEND:-openai}" +ENDPOINT="${ENDPOINT:-}" +RANDOM_RANGE_RATIO="${RANDOM_RANGE_RATIO:-0.8}" +NUM_PROMPTS_MULT="${NUM_PROMPTS_MULT:-10}" +USE_CHAT_TEMPLATE="${USE_CHAT_TEMPLATE:-true}" +DSV4="${DSV4:-false}" +TRUST_REMOTE_CODE="${TRUST_REMOTE_CODE:-true}" + +RESULT_DIR="/logs/sa-bench_isl_${ISL}_osl_${OSL}" +mkdir -p "$RESULT_DIR" + +# srt-slurm worker containers don't always ship bench_serving.py's runtime +# deps (datasets in particular). Install missing ones into a system-site- +# packages venv so we don't perturb the framework's own packages. +ensure_bench_serving_deps() { + local deps=(aiohttp numpy pandas datasets Pillow tqdm transformers huggingface_hub) + if python3 -c "import aiohttp, numpy, pandas, datasets, PIL, tqdm, transformers, huggingface_hub" 2>/dev/null; then + return + fi + local venv="/tmp/srt-bench-venv" + [[ -d "$venv" ]] || python3 -m venv --system-site-packages "$venv" + # shellcheck disable=SC1091 + source "$venv/bin/activate" + pip install --quiet "${deps[@]}" +} +ensure_bench_serving_deps + +curl -fsS "http://localhost:${PORT}/v1/models" >/dev/null || { + echo "ERROR: frontend at http://localhost:${PORT} did not respond on /v1/models" >&2 + exit 66 +} +ulimit -n 65536 2>/dev/null || true + +# CONC_LIST from the workflow is space-separated; bench loops one run per value. +read -r -a CONC_LIST_ARR <<< "$CONC_LIST" + +for conc in "${CONC_LIST_ARR[@]}"; do + if [[ "$DISAGG" == "true" ]]; then + result_filename="results_concurrency_${conc}_gpus_${TOTAL_GPUS}_ctx_${PREFILL_GPUS}_gen_${DECODE_GPUS}" + else + result_filename="results_concurrency_${conc}_gpus_${TOTAL_GPUS}" + fi + echo "=== conc=$conc → $RESULT_DIR/${result_filename}.json ===" + + args=( + --model "$MODEL_NAME" + --tokenizer /model + --port "$PORT" + --backend "$BACKEND" + --input-len "$ISL" + --output-len "$OSL" + --random-range-ratio "$RANDOM_RANGE_RATIO" + --num-prompts "$((conc * NUM_PROMPTS_MULT))" + --max-concurrency "$conc" + --result-filename "$result_filename" + --result-dir "$RESULT_DIR" + --bench-serving-dir "$INFMAX_WS" + ) + [[ -n "$ENDPOINT" ]] && args+=(--endpoint "$ENDPOINT") + [[ "$USE_CHAT_TEMPLATE" == "true" ]] && args+=(--use-chat-template) + [[ "$DSV4" == "true" ]] && args+=(--dsv4) + [[ "$TRUST_REMOTE_CODE" == "true" ]] && args+=(--trust-remote-code) + + run_benchmark_serving "${args[@]}" +done + +echo "Done. Results in $RESULT_DIR." diff --git a/runners/launch_b200-cw.sh b/runners/launch_b200-cw.sh index 0b2dbf305..fbdd60554 100644 --- a/runners/launch_b200-cw.sh +++ b/runners/launch_b200-cw.sh @@ -1,5 +1,7 @@ #!/usr/bin/env bash +source "$(dirname "$0")/../benchmarks/benchmark_lib.sh" + export HF_HUB_CACHE_MOUNT="/tmp/gharunner/hf-hub-cache" export PORT=8888 @@ -16,7 +18,7 @@ if [[ ! -f "$BENCH_SCRIPT" ]]; then fi PARTITION="b200" -SQUASH_FILE="/tmp/gharunner/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" +SQUASH_FILE="/tmp/gharunner/squash/$(sanitize_image_filename "$IMAGE").sqsh" LOCK_FILE="${SQUASH_FILE}.lock" # TODO(Cam): lmsysorg/sglang:deepseek-v4-blackwell installs sglang editable at diff --git a/runners/launch_b200-dgxc.sh b/runners/launch_b200-dgxc.sh index edf5db957..3e294f859 100644 --- a/runners/launch_b200-dgxc.sh +++ b/runners/launch_b200-dgxc.sh @@ -4,6 +4,8 @@ SLURM_PARTITION="gpu" SLURM_ACCOUNT="benchmark" +source "$(dirname "$0")/../benchmarks/benchmark_lib.sh" + set -x if [[ "$IS_MULTINODE" == "true" ]]; then @@ -29,35 +31,14 @@ if [[ "$IS_MULTINODE" == "true" ]]; then fi export SERVED_MODEL_NAME=$MODEL - echo "Cloning srt-slurm repository..." - SRT_REPO_DIR="srt-slurm" - if [ -d "$SRT_REPO_DIR" ]; then - echo "Removing existing $SRT_REPO_DIR..." - rm -rf "$SRT_REPO_DIR" - fi - - git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR" - cd "$SRT_REPO_DIR" || exit 1 - git checkout sa-submission-q2-2026 - - echo "Installing srtctl..." - export UV_INSTALL_DIR="$GITHUB_WORKSPACE/.local/bin" - curl -LsSf https://astral.sh/uv/install.sh | sh - export PATH="$UV_INSTALL_DIR:$PATH" - - uv venv "$GITHUB_WORKSPACE/.venv" - source "$GITHUB_WORKSPACE/.venv/bin/activate" - uv pip install -e . - - if ! command -v srtctl &> /dev/null; then - echo "Error: Failed to install srtctl" - exit 1 - fi + UV_INSTALL_DIR="$GITHUB_WORKSPACE/.local/bin" \ + UV_VENV_DIR="$GITHUB_WORKSPACE/.venv" \ + clone_and_install_srtctl || exit 1 # Map container images to local squash files NGINX_IMAGE="nginx:1.27.4" - SQUASH_FILE="/home/sa-shared/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" - NGINX_SQUASH_FILE="/home/sa-shared/containers/$(echo "$NGINX_IMAGE" | sed 's/[\/:@#]/_/g').sqsh" + SQUASH_FILE="/home/sa-shared/containers/$(sanitize_image_filename "$IMAGE").sqsh" + NGINX_SQUASH_FILE="/home/sa-shared/containers/$(sanitize_image_filename "$NGINX_IMAGE").sqsh" # Import containers via enroot enroot import -o $SQUASH_FILE docker://$IMAGE @@ -105,7 +86,7 @@ EOF echo "Submitting job with srtctl..." if [[ -z "$CONFIG_FILE" ]]; then - echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a CONFIG_FILE in additional-settings." >&2 + echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a 'recipe:' field on the search-space entry (resolved by benchmark-multinode-tmpl.yml)." >&2 echo "Config: MODEL_PREFIX=${MODEL_PREFIX} PRECISION=${PRECISION} FRAMEWORK=${FRAMEWORK}" >&2 exit 1 fi @@ -250,7 +231,7 @@ EOF else HF_HUB_CACHE_MOUNT="/scratch/fsw/gharunners/hf-hub-cache" - SQUASH_FILE="/home/sa-shared/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" + SQUASH_FILE="/home/sa-shared/containers/$(sanitize_image_filename "$IMAGE").sqsh" FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '') SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '') # Prefer a framework-tagged script (e.g. dsv4_fp4_b200_vllm.sh) so models diff --git a/runners/launch_b300-nv.sh b/runners/launch_b300-nv.sh index 3c855e805..23f75ac80 100644 --- a/runners/launch_b300-nv.sh +++ b/runners/launch_b300-nv.sh @@ -4,6 +4,8 @@ SLURM_PARTITION="batch_1" SLURM_ACCOUNT="benchmark" +source "$(dirname "$0")/../benchmarks/benchmark_lib.sh" + set -x if [[ "$IS_MULTINODE" == "true" ]]; then @@ -30,35 +32,14 @@ else exit 1 fi -echo "Cloning srt-slurm repository..." -SRT_REPO_DIR="srt-slurm" -if [ -d "$SRT_REPO_DIR" ]; then - echo "Removing existing $SRT_REPO_DIR..." - rm -rf "$SRT_REPO_DIR" -fi - -git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR" -cd "$SRT_REPO_DIR" || exit 1 -git checkout sa-submission-q2-2026 - -echo "Installing srtctl..." -export UV_INSTALL_DIR="$GITHUB_WORKSPACE/.local/bin" -curl -LsSf https://astral.sh/uv/install.sh | sh -export PATH="$UV_INSTALL_DIR:$PATH" - -uv venv "$GITHUB_WORKSPACE/.venv" -source "$GITHUB_WORKSPACE/.venv/bin/activate" -uv pip install -e . - -if ! command -v srtctl &> /dev/null; then - echo "Error: Failed to install srtctl" - exit 1 -fi +UV_INSTALL_DIR="$GITHUB_WORKSPACE/.local/bin" \ +UV_VENV_DIR="$GITHUB_WORKSPACE/.venv" \ + clone_and_install_srtctl || exit 1 # Map container images to local squash files NGINX_IMAGE="nginx:1.27.4" -SQUASH_FILE="/data/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" -NGINX_SQUASH_FILE="/data/squash/$(echo "$NGINX_IMAGE" | sed 's/[\/:@#]/_/g').sqsh" +SQUASH_FILE="/data/squash/$(sanitize_image_filename "$IMAGE").sqsh" +NGINX_SQUASH_FILE="/data/squash/$(sanitize_image_filename "$NGINX_IMAGE").sqsh" # Import containers via enroot srun -N 1 -A $SLURM_ACCOUNT -p $SLURM_PARTITION bash -c "enroot import -o $SQUASH_FILE docker://$IMAGE" @@ -108,7 +89,7 @@ export INFMAX_WORKSPACE="$GITHUB_WORKSPACE" echo "Submitting job with srtctl..." if [[ -z "$CONFIG_FILE" ]]; then - echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a CONFIG_FILE in additional-settings." >&2 + echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a 'recipe:' field on the search-space entry (resolved by benchmark-multinode-tmpl.yml)." >&2 echo "Config: MODEL_PREFIX=${MODEL_PREFIX} PRECISION=${PRECISION} FRAMEWORK=${FRAMEWORK}" >&2 exit 1 fi @@ -258,7 +239,7 @@ else elif [[ "$MODEL_PREFIX" == "dsv4" ]]; then export MODEL="$HF_HUB_CACHE_MOUNT/dsv4-pro" fi - SQUASH_FILE="/data/home/sa-shared/gharunners/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" + SQUASH_FILE="/data/home/sa-shared/gharunners/squash/$(sanitize_image_filename "$IMAGE").sqsh" SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '') # Prefer a framework-tagged script (e.g. dsv4_fp4_b300_sglang.sh) so models # with multiple inference engines can coexist; fall back to the historical diff --git a/runners/launch_gb200-nv.sh b/runners/launch_gb200-nv.sh index 333e94359..c8c822c6f 100755 --- a/runners/launch_gb200-nv.sh +++ b/runners/launch_gb200-nv.sh @@ -2,6 +2,8 @@ # This script sets up the environment and launches multi-node benchmarks +source "$(dirname "$0")/../benchmarks/benchmark_lib.sh" + set -x # MODEL_PATH: Override with pre-downloaded paths on GB200 runner @@ -62,8 +64,8 @@ export SLURM_ACCOUNT="benchmark" NGINX_IMAGE="nginx:1.27.4" -SQUASH_FILE="/mnt/lustre01/users-public/sa-shared/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" -NGINX_SQUASH_FILE="/mnt/lustre01/users-public/sa-shared/$(echo "$NGINX_IMAGE" | sed 's/[\/:@#]/_/g').sqsh" +SQUASH_FILE="/mnt/lustre01/users-public/sa-shared/$(sanitize_image_filename "$IMAGE").sqsh" +NGINX_SQUASH_FILE="/mnt/lustre01/users-public/sa-shared/$(sanitize_image_filename "$NGINX_IMAGE").sqsh" enroot import -o $SQUASH_FILE docker://$IMAGE enroot import -o $NGINX_SQUASH_FILE docker://$NGINX_IMAGE @@ -125,57 +127,19 @@ PY fi -# srt-slurm path requires a CONFIG_FILE pointing to a recipe YAML. -# Without it, srtctl apply scans every YAML in the repo and submits hundreds of jobs. +# srt-slurm path requires CONFIG_FILE (set by benchmark-multinode-tmpl.yml from +# the search-space `recipe:` field). Without it, srtctl apply scans every YAML +# in the repo and submits hundreds of jobs. if [[ -z "$CONFIG_FILE" ]]; then - echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a CONFIG_FILE in additional-settings." >&2 + echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a 'recipe:' field on the search-space entry (resolved by benchmark-multinode-tmpl.yml)." >&2 echo "Config: MODEL_PREFIX=${MODEL_PREFIX} PRECISION=${PRECISION} FRAMEWORK=${FRAMEWORK}" >&2 exit 1 fi -echo "Cloning srt-slurm repository..." -SRT_REPO_DIR="srt-slurm" -if [ -d "$SRT_REPO_DIR" ]; then - echo "Removing existing $SRT_REPO_DIR..." - rm -rf "$SRT_REPO_DIR" -fi - -if [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "dsv4" ]]; then - git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR" - cd "$SRT_REPO_DIR" - git checkout aflowers/vllm-gb200-v0.20.0 - # Use `cp -rT` so if the upstream branch ever ships a stub - # `recipes/vllm/deepseek-v4/` directory, we overlay our recipes onto - # it rather than nesting (`cp -r src dst` would create - # `recipes/vllm/deepseek-v4/deepseek-v4/...` in that case). - mkdir -p recipes/vllm/deepseek-v4 - cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4" recipes/vllm/deepseek-v4 -elif [[ $FRAMEWORK == "dynamo-vllm" ]]; then - git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR" - cd "$SRT_REPO_DIR" - git checkout sa-submission-q2-2026 -elif [[ $FRAMEWORK == "dynamo-trt" && $MODEL_PREFIX == "kimik2.5" ]]; then - git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR" - cd "$SRT_REPO_DIR" - git checkout sa-submission-q2-2026 -else - git clone https://github.com/ishandhanani/srt-slurm.git "$SRT_REPO_DIR" - cd "$SRT_REPO_DIR" - git checkout sa-submission-q1-2026 -fi - -echo "Installing srtctl..." -curl -LsSf https://astral.sh/uv/install.sh | sh -source $HOME/.local/bin/env - -uv venv -source .venv/bin/activate -uv pip install -e . - -if ! command -v srtctl &> /dev/null; then - echo "Error: Failed to install srtctl" - exit 1 -fi +# We only clone srt-slurm to install srtctl + pick up its sibling configs +# (configs/, expert-distributions/, etc). The recipe itself is supplied as an +# absolute CONFIG_FILE pointing at benchmarks/multi_node/srt-slurm-recipes/. +clone_and_install_srtctl || exit 1 echo "Configs available at: $SRT_REPO_DIR/" diff --git a/runners/launch_gb300-nv.sh b/runners/launch_gb300-nv.sh index 5f48ddcec..a0790260e 100644 --- a/runners/launch_gb300-nv.sh +++ b/runners/launch_gb300-nv.sh @@ -2,6 +2,8 @@ # This script sets up the environment and launches multi-node benchmarks +source "$(dirname "$0")/../benchmarks/benchmark_lib.sh" + set -x export SLURM_PARTITION="batch" @@ -25,8 +27,8 @@ fi NGINX_IMAGE="nginx:1.27.4" -SQUASH_FILE="/home/sa-shared/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" -NGINX_SQUASH_FILE="/home/sa-shared/squash/$(echo "$NGINX_IMAGE" | sed 's/[\/:@#]/_/g').sqsh" +SQUASH_FILE="/home/sa-shared/squash/$(sanitize_image_filename "$IMAGE").sqsh" +NGINX_SQUASH_FILE="/home/sa-shared/squash/$(sanitize_image_filename "$NGINX_IMAGE").sqsh" srun --partition=$SLURM_PARTITION --exclusive --time=180 bash -c "enroot import -o $SQUASH_FILE docker://$IMAGE" srun --partition=$SLURM_PARTITION --exclusive --time=180 bash -c "enroot import -o $NGINX_SQUASH_FILE docker://$NGINX_IMAGE" @@ -36,30 +38,9 @@ export EVAL_ONLY="${EVAL_ONLY:-false}" export ISL="$ISL" export OSL="$OSL" -echo "Cloning srt-slurm repository..." -SRT_REPO_DIR="srt-slurm" -if [ -d "$SRT_REPO_DIR" ]; then - echo "Removing existing $SRT_REPO_DIR..." - rm -rf "$SRT_REPO_DIR" -fi - -git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR" -cd "$SRT_REPO_DIR" -git checkout sa-submission-q2-2026 - -echo "Installing srtctl..." -export UV_INSTALL_DIR="$GITHUB_WORKSPACE/.local/bin" -curl -LsSf https://astral.sh/uv/install.sh | sh -export PATH="$UV_INSTALL_DIR:$PATH" - -uv venv "$GITHUB_WORKSPACE/.venv" -source "$GITHUB_WORKSPACE/.venv/bin/activate" -uv pip install -e . - -if ! command -v srtctl &> /dev/null; then - echo "Error: Failed to install srtctl" - exit 1 -fi +UV_INSTALL_DIR="$GITHUB_WORKSPACE/.local/bin" \ +UV_VENV_DIR="$GITHUB_WORKSPACE/.venv" \ + clone_and_install_srtctl || exit 1 echo "Configs available at: $SRT_REPO_DIR/" @@ -103,7 +84,7 @@ export INFMAX_WORKSPACE="$GITHUB_WORKSPACE" echo "Submitting job with srtctl..." if [[ -z "$CONFIG_FILE" ]]; then - echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a CONFIG_FILE in additional-settings." >&2 + echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a 'recipe:' field on the search-space entry (resolved by benchmark-multinode-tmpl.yml)." >&2 echo "Config: MODEL_PREFIX=${MODEL_PREFIX} PRECISION=${PRECISION} FRAMEWORK=${FRAMEWORK}" >&2 exit 1 fi diff --git a/runners/launch_h100-cw.sh b/runners/launch_h100-cw.sh index f3198ca8c..e036e6219 100644 --- a/runners/launch_h100-cw.sh +++ b/runners/launch_h100-cw.sh @@ -1,8 +1,10 @@ #!/usr/bin/env bash +source "$(dirname "$0")/../benchmarks/benchmark_lib.sh" + export HF_HUB_CACHE_MOUNT="/mnt/vast/gharunner/hf-hub-cache" PARTITION="h100" -SQUASH_FILE="/mnt/vast/gharunner/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" +SQUASH_FILE="/mnt/vast/gharunner/squash/$(sanitize_image_filename "$IMAGE").sqsh" LOCK_FILE="${SQUASH_FILE}.lock" set -x diff --git a/runners/launch_h100-dgxc-slurm.sh b/runners/launch_h100-dgxc-slurm.sh index 5a2ab64d2..f95816448 100644 --- a/runners/launch_h100-dgxc-slurm.sh +++ b/runners/launch_h100-dgxc-slurm.sh @@ -5,6 +5,8 @@ SLURM_PARTITION="hpc-gpu-1" SLURM_ACCOUNT="customer" SLURM_EXCLUDED_NODELIST="hpc-gpu-1-7" +source "$(dirname "$0")/../benchmarks/benchmark_lib.sh" + set -x if [[ "$IS_MULTINODE" == "true" ]]; then @@ -34,36 +36,13 @@ if [[ "$IS_MULTINODE" == "true" ]]; then exit 1 fi - echo "Cloning srt-slurm repository..." - SRT_REPO_DIR="srt-slurm" - if [ -d "$SRT_REPO_DIR" ]; then - echo "Removing existing $SRT_REPO_DIR..." - rm -rf "$SRT_REPO_DIR" - fi - - git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR" - cd "$SRT_REPO_DIR" - git checkout sa-submission-q2-2026 - - echo "Installing srtctl..." - export UV_INSTALL_DIR="/mnt/nfs/sa-shared/.uv/bin" + # Pin uv state onto the NFS-shared volume so cluster nodes share a single + # cached install, and so the binary persists across runner workspaces. export UV_CACHE_DIR="/mnt/nfs/sa-shared/.uv/cache" export UV_PYTHON_INSTALL_DIR="/mnt/nfs/sa-shared/.uv/python" - mkdir -p "$UV_INSTALL_DIR" "$UV_CACHE_DIR" "$UV_PYTHON_INSTALL_DIR" - if ! [ -x "$UV_INSTALL_DIR/uv" ]; then - curl -LsSf https://astral.sh/uv/install.sh | sh - fi - export PATH="$UV_INSTALL_DIR:$PATH" - source $UV_INSTALL_DIR/env - - uv venv - source .venv/bin/activate - uv pip install -e . - - if ! command -v srtctl &> /dev/null; then - echo "Error: Failed to install srtctl" - exit 1 - fi + mkdir -p "$UV_CACHE_DIR" "$UV_PYTHON_INSTALL_DIR" + UV_INSTALL_DIR="/mnt/nfs/sa-shared/.uv/bin" \ + clone_and_install_srtctl || exit 1 echo "Configs available at: $SRT_REPO_DIR/" @@ -77,7 +56,7 @@ if [[ "$IS_MULTINODE" == "true" ]]; then elif [[ $FRAMEWORK == "dynamo-trt" ]]; then # TRT-LLM container mapping - convert IMAGE to srt-slurm format (nvcr.io/ -> nvcr.io#) CONTAINER_KEY=$(echo "$IMAGE" | sed 's|nvcr.io/|nvcr.io#|') - SQUASH_FILE="/mnt/nfs/sa-shared/containers/$(echo "$IMAGE" | sed 's|nvcr.io/||' | sed 's/[\/:@#]/+/g').sqsh" + SQUASH_FILE="/mnt/nfs/sa-shared/containers/$(sanitize_image_filename "${IMAGE#nvcr.io/}" +).sqsh" fi export ISL="$ISL" @@ -126,7 +105,7 @@ EOF echo "Submitting job with srtctl..." if [[ -z "$CONFIG_FILE" ]]; then - echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a CONFIG_FILE in additional-settings." >&2 + echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a 'recipe:' field on the search-space entry (resolved by benchmark-multinode-tmpl.yml)." >&2 echo "Config: MODEL_PREFIX=${MODEL_PREFIX} PRECISION=${PRECISION} FRAMEWORK=${FRAMEWORK}" >&2 exit 1 fi @@ -270,7 +249,7 @@ EOF else HF_HUB_CACHE_MOUNT="/mnt/nfs/sa-shared/gharunners/hf-hub-cache/" - SQUASH_FILE="/mnt/nfs/lustre/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" + SQUASH_FILE="/mnt/nfs/lustre/containers/$(sanitize_image_filename "$IMAGE").sqsh" salloc --exclude="$SLURM_EXCLUDED_NODELIST" --partition=$SLURM_PARTITION --account=$SLURM_ACCOUNT --gres=gpu:$TP --exclusive --time=180 --no-shell --job-name="$RUNNER_NAME" JOB_ID=$(squeue --name="$RUNNER_NAME" -u "$USER" -h -o %A | head -n1) diff --git a/runners/launch_h200-cw.sh b/runners/launch_h200-cw.sh index 84b40480c..08bbbc757 100644 --- a/runners/launch_h200-cw.sh +++ b/runners/launch_h200-cw.sh @@ -1,5 +1,7 @@ #!/usr/bin/env bash +source "$(dirname "$0")/../benchmarks/benchmark_lib.sh" + export HF_HUB_CACHE_MOUNT="/mnt/vast/gharunner/hf-hub-cache" export PORT=8888 @@ -8,7 +10,7 @@ FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '') SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '') PARTITION="h200" -SQUASH_FILE="/mnt/vast/gharunner/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" +SQUASH_FILE="/mnt/vast/gharunner/squash/$(sanitize_image_filename "$IMAGE").sqsh" LOCK_FILE="${SQUASH_FILE}.lock" set -x diff --git a/runners/launch_h200-dgxc-slurm.sh b/runners/launch_h200-dgxc-slurm.sh index e11ca7b20..71a64025f 100755 --- a/runners/launch_h200-dgxc-slurm.sh +++ b/runners/launch_h200-dgxc-slurm.sh @@ -4,6 +4,8 @@ SLURM_PARTITION="main" SLURM_ACCOUNT="sa-shared" +source "$(dirname "$0")/../benchmarks/benchmark_lib.sh" + set -x if [[ "$IS_MULTINODE" == "true" ]]; then @@ -33,29 +35,7 @@ if [[ "$IS_MULTINODE" == "true" ]]; then exit 1 fi - echo "Cloning srt-slurm repository..." - SRT_REPO_DIR="srt-slurm" - if [ -d "$SRT_REPO_DIR" ]; then - echo "Removing existing $SRT_REPO_DIR..." - rm -rf "$SRT_REPO_DIR" - fi - - git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR" - cd "$SRT_REPO_DIR" - git checkout sa-submission-q2-2026 - - echo "Installing srtctl..." - curl -LsSf https://astral.sh/uv/install.sh | sh - source $HOME/.local/bin/env - - uv venv - source .venv/bin/activate - uv pip install -e . - - if ! command -v srtctl &> /dev/null; then - echo "Error: Failed to install srtctl" - exit 1 - fi + clone_and_install_srtctl || exit 1 echo "Configs available at: $SRT_REPO_DIR/" @@ -64,12 +44,12 @@ if [[ "$IS_MULTINODE" == "true" ]]; then if [[ $FRAMEWORK == "dynamo-sglang" ]]; then # SGLang container mapping - SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/+/g').sqsh" + SQUASH_FILE="/data/containers/$(sanitize_image_filename "$IMAGE" +).sqsh" CONTAINER_KEY="$IMAGE" elif [[ $FRAMEWORK == "dynamo-trt" ]]; then # TRT-LLM container mapping - convert IMAGE to srt-slurm format (nvcr.io/ -> nvcr.io#) CONTAINER_KEY=$(echo "$IMAGE" | sed 's|nvcr.io/|nvcr.io#|') - SQUASH_FILE="/data/containers/$(echo "$IMAGE" | sed 's|nvcr.io/||' | sed 's/[\/:@#]/+/g').sqsh" + SQUASH_FILE="/data/containers/$(sanitize_image_filename "${IMAGE#nvcr.io/}" +).sqsh" fi export ISL="$ISL" @@ -119,7 +99,7 @@ EOF echo "Submitting job with srtctl..." if [[ -z "$CONFIG_FILE" ]]; then - echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a CONFIG_FILE in additional-settings." >&2 + echo "Error: CONFIG_FILE is not set. The srt-slurm path requires a 'recipe:' field on the search-space entry (resolved by benchmark-multinode-tmpl.yml)." >&2 echo "Config: MODEL_PREFIX=${MODEL_PREFIX} PRECISION=${PRECISION} FRAMEWORK=${FRAMEWORK}" >&2 exit 1 fi @@ -262,7 +242,7 @@ EOF else HF_HUB_CACHE_MOUNT="/models/gharunners/hf-hub-cache" - SQUASH_FILE="/data/gharunners/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" + SQUASH_FILE="/data/gharunners/containers/$(sanitize_image_filename "$IMAGE").sqsh" # Convert pyxis image format (nvcr.io#path) to docker format (nvcr.io/path) for enroot import DOCKER_IMAGE=$(echo "$IMAGE" | sed 's/#/\//g') diff --git a/runners/launch_h200-nb.sh b/runners/launch_h200-nb.sh index 9d157a858..849f73699 100644 --- a/runners/launch_h200-nb.sh +++ b/runners/launch_h200-nb.sh @@ -1,5 +1,7 @@ #!/usr/bin/bash +source "$(dirname "$0")/../benchmarks/benchmark_lib.sh" + export HF_HUB_CACHE_MOUNT="/mnt/data/gharunners/hf-hub-cache/" export PORT=8888 @@ -12,7 +14,7 @@ PARTITION="main" set -x srun --partition=$PARTITION --gres=gpu:$TP --exclusive --job-name="$RUNNER_NAME" \ --container-image=$IMAGE \ ---container-name=$(echo "$IMAGE" | sed 's/[\/:@#]/_/g')-${USER} \ +--container-name=$(sanitize_image_filename "$IMAGE")-${USER} \ --container-mounts=$GITHUB_WORKSPACE:/workspace/,$HF_HUB_CACHE_MOUNT:$HF_HUB_CACHE \ --container-remap-root \ --container-writable \ diff --git a/runners/launch_mi300x-amds.sh b/runners/launch_mi300x-amds.sh index b654c515a..da98f3015 100644 --- a/runners/launch_mi300x-amds.sh +++ b/runners/launch_mi300x-amds.sh @@ -1,10 +1,12 @@ #!/usr/bin/env bash +source "$(dirname "$0")/../benchmarks/benchmark_lib.sh" + export HF_HUB_CACHE_MOUNT="/raid/hf-hub-cache/" export PORT=8888 PARTITION="compute" -SQUASH_FILE="/home/gharunner/gharunners/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" +SQUASH_FILE="/home/gharunner/gharunners/squash/$(sanitize_image_filename "$IMAGE").sqsh" LOCK_FILE="${SQUASH_FILE}.lock" set -x diff --git a/runners/launch_mi325x-amds.sh b/runners/launch_mi325x-amds.sh index 67f93a309..200b46838 100644 --- a/runners/launch_mi325x-amds.sh +++ b/runners/launch_mi325x-amds.sh @@ -1,10 +1,12 @@ #!/usr/bin/env bash +source "$(dirname "$0")/../benchmarks/benchmark_lib.sh" + export HF_HUB_CACHE_MOUNT="/nfsdata/sa/gharunner/gharunners/hf-hub-cache/" export PORT=8888 PARTITION="compute" -SQUASH_FILE="/nfsdata/sa/gharunner/gharunners/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" +SQUASH_FILE="/nfsdata/sa/gharunner/gharunners/squash/$(sanitize_image_filename "$IMAGE").sqsh" LOCK_FILE="${SQUASH_FILE}.lock" set -x diff --git a/runners/launch_mi355x-amds.sh b/runners/launch_mi355x-amds.sh index 152745d4e..a14cfdb2c 100644 --- a/runners/launch_mi355x-amds.sh +++ b/runners/launch_mi355x-amds.sh @@ -1,5 +1,7 @@ #!/usr/bin/env bash +source "$(dirname "$0")/../benchmarks/benchmark_lib.sh" + scancel_sync() { local jobid=$1 local timeout=${2:-600} @@ -182,7 +184,7 @@ else SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '') PARTITION="compute" - SQUASH_FILE="/var/lib/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" + SQUASH_FILE="/var/lib/squash/$(sanitize_image_filename "$IMAGE").sqsh" LOCK_FILE="${SQUASH_FILE}.lock" set -x diff --git a/utils/matrix_logic/generate_sweep_configs.py b/utils/matrix_logic/generate_sweep_configs.py index e543bb4af..44613e8eb 100644 --- a/utils/matrix_logic/generate_sweep_configs.py +++ b/utils/matrix_logic/generate_sweep_configs.py @@ -267,6 +267,8 @@ def generate_full_sweep(args, all_config_data, runner_data): seq_len_str = seq_len_to_str(isl, osl) runners_for_entry = runner_nodes_to_use if runner_nodes_to_use else [runner] + recipe = bmk.get(Fields.RECIPE.value) + for runner_value in runners_for_entry: entry = { Fields.IMAGE.value: image, @@ -285,6 +287,7 @@ def generate_full_sweep(args, all_config_data, runner_data): Fields.EXP_NAME.value: f"{model_code}_{seq_len_str}", Fields.DISAGG.value: disagg, Fields.RUN_EVAL.value: False, # Default, may be overridden by mark_eval_entries + Fields.RECIPE.value: recipe, } validate_matrix_entry(entry, is_multinode) @@ -463,6 +466,7 @@ def get_lowest_conc(search_space_entry): Fields.SPEC_DECODING.value, "none") prefill_config = lowest_conc_entry[Fields.PREFILL.value] decode_config = lowest_conc_entry[Fields.DECODE.value] + recipe = lowest_conc_entry.get(Fields.RECIPE.value) for node in runner_nodes: entry = { @@ -494,6 +498,7 @@ def get_lowest_conc(search_space_entry): Fields.EXP_NAME.value: f"{model_code}_test", Fields.DISAGG.value: disagg, Fields.RUN_EVAL.value: False, + Fields.RECIPE.value: recipe, } matrix_values.append(validate_matrix_entry(entry, is_multinode=True)) else: @@ -620,6 +625,7 @@ def generate_test_config_sweep(args, all_config_data): Fields.EXP_NAME.value: f"{model_code}_{seq_len_str}", Fields.DISAGG.value: disagg, Fields.RUN_EVAL.value: False, + Fields.RECIPE.value: bmk.get(Fields.RECIPE.value), } matrix_values.append(validate_matrix_entry(entry, is_multinode=True)) else: diff --git a/utils/matrix_logic/validation.py b/utils/matrix_logic/validation.py index ce10840b5..7f1fa3326 100644 --- a/utils/matrix_logic/validation.py +++ b/utils/matrix_logic/validation.py @@ -1,3 +1,5 @@ +from pathlib import Path + from pydantic import BaseModel, Field, ValidationError, ConfigDict, model_validator from typing import List, Optional, Union, Literal from enum import Enum @@ -5,6 +7,11 @@ import pprint import yaml +# Repo-relative root for first-class srt-slurm recipes referenced by the +# `recipe:` field on multi-node search-space entries. Resolved against the +# repository root (parent of utils/) so callers can run from any cwd. +RECIPES_ROOT = Path(__file__).resolve().parents[2] / "benchmarks" / "multi_node" / "srt-slurm-recipes" + """ The below class defines the field names expected to be present in the JSON entries for both single-node and multi-node configurations. @@ -44,6 +51,7 @@ class Fields(Enum): BATCH_SIZE = 'batch-size' MAX_NUM_TOKENS = 'max-num-tokens' ADDITIONAL_SETTINGS = 'additional-settings' + RECIPE = 'recipe' # Matrix entry fields CONC = 'conc' @@ -131,6 +139,11 @@ class MultiNodeMatrixEntry(BaseModel): run_eval: bool = Field(alias=Fields.RUN_EVAL.value) eval_only: bool = Field(alias=Fields.EVAL_ONLY.value, default=False) eval_conc: Optional[int] = Field(default=None, alias=Fields.EVAL_CONC.value) + # Path under benchmarks/multi_node/srt-slurm-recipes/ identifying the + # srt-slurm recipe to dispatch. May carry an `:override[N]` suffix that the + # launcher strips before resolving the file on disk. Optional because not + # every multi-node config uses srt-slurm. + recipe: Optional[str] = None def validate_matrix_entry(entry: dict, is_multinode: bool) -> dict: @@ -234,11 +247,31 @@ class MultiNodeSearchSpaceEntry(BaseModel): default=None, alias=Fields.CONC_END.value) conc_list: Optional[List[int]] = Field( default=None, alias=Fields.CONC_LIST.value) + # First-class srt-slurm recipe reference. Path is relative to + # benchmarks/multi_node/srt-slurm-recipes/ and may carry an + # `:override[N]` suffix to select an in-yaml override section. + recipe: Optional[str] = None @model_validator(mode='after') def validate_conc_fields(self): return _validate_conc_fields(self) + @model_validator(mode='after') + def validate_recipe_exists(self): + if self.recipe is None: + return self + # Strip `:override[...]` suffix used by sglang-style recipes that + # carry multiple override sections in one file. + recipe_path = self.recipe.split(':', 1)[0] + full_path = RECIPES_ROOT / recipe_path + if not full_path.is_file(): + raise ValueError( + f"Recipe file not found: '{self.recipe}' " + f"(resolved to '{full_path}'). " + f"Recipes must live under benchmarks/multi_node/srt-slurm-recipes/." + ) + return self + class SingleNodeSeqLenConfig(BaseModel): """Single node sequence length configuration."""